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EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1972, 32, 3-22. 


AN EMPIRICAL INVESTIGATION OF SOME SPECIAL 
CASES OF THE GENERAL “ОВТНОМАХ” CRITERION 
FOR ORTHOGONAL FACTOR TRANSFORMATION 


А. RALPH HAKSTIAN 
University of Alberta 
WILLIAM M. BOYD: 
Memorial University of Newfoundland 


In this paper, the results of an investigation into some special 
cases of the general “orthomax” formulation are presented. In par- 
ticular, the effects of manipulating a parameter in this formulation 
on various aspects of factor solutions are identified through the use 
of four sets of data, varying considerably in size, and reliability and 
factorial complexity of the variables. The implications for practical 
purposes of the results are subsequently discussed. 

The appropriateness of orthogonal transformation has, in the 
authors’ opinion, been somewhat misrepresented over the years, by 
the large number of studies conducted with little methodological 
consideration, using the standard computing center package pro- 
viding either principal components or common-factors, rotated using 
the varimax technique. It is seldom true that anything approaching 
an optimal simple structure will result when orthogonality is im- 
posed upon a solution. This is largely because it is unlikely that, if 
allowed unconstrained expression, the important factors underlying 
а set of variables will turn out to be mutually orthogonal, although 
if this condition does, in fact, obtain, a solution resulting from a 
Proven oblique transformation, such as provided by the methods of 
Harris and Kaiser (1964), will reflect this. Orthogonality of factor 
— 

"The authors are pleased to acknowledge the assistance of Mr. Adonis Е. 
Labor and Mr. Ernest N. Skakun, in checking the computations. 
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axes should most properly be considered a constraint imposed upon 
a set of data for a particular purpose. 

The purpose may involve the development of an instrument or 
set of theoretical constructs in which the independence of the com- 
ponent parts is an important feature. Also, an orthogonal solution 
may be desired as an intermediate step from which to proceed to an 
ultimate oblique resolution; in the aforementioned Harris-Kaiser 
technique, for example, such a step permits oblique solutions with 
no risk of transforming to singularity. In the context of the general 
factor analytic study, however, an orthogonal solution—as the final 
result—will seldom permit the maximum clarity of factor interpre- 
tation for the data at hand, and it is only for the special purposes 
noted that this paper was written. 

The history of automatic or nonsubjective orthogonal transfor- 
mation has followed two major paths. The first, in the direction of 
“blind” transformation, has had as its guideposts, the quartimax 

(Carroll, 1953; Ferguson, 1954; Neuhaus and Wrigley, 1954; Saun- 
ders, 1953), varimax (Kaiser, 1958), and equamax (Saunders, 1962) 
criteria. The second, directed towards hypothesis confirmatory trans- 
formation, generally referred to as the orthogonal Procrustes prob- | 
lem, has been most thoroughly charted in the work of Schónemann 
(1966a). A possible third path, having the same goal as the first but 
crossing the second in places, is represented by the varisim tech- 
nique (Schénemann, 1966b). The work reported in this paper is 
clearly an extension of the first alternative. 

The three aforementioned analytic criteria in the “blind” ap- 
proach can be seen as special cases of a more general, “orthomax | 
criterion? (Harman, 1960; Harris and Kaiser, 1964) : | 


n x Ж b; — w 52 (x bs) = maximum, 


i=l »-1 Жа inl 
where bjp is the loading of variable j on orthogonally transformed | 
factor p, n is the number of variables, and m, of factors. The рагай“ | 
eter w, regulating the weight given the second term, determines the 
special case of the formulation, with a value of 0 giving the quarti- 
max criterion, 1 giving varimax, and m/2, equamax. In practice, ё 
normalized form of this criterion is generally used, with each vari 


B. Carroll. See Harman rice a. appear to have originated with 1 
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able extended to unit length for the purposes of the maximization. 
Recent papers by Crawford and Ferguson (1970) and Jennrich 
(1970) contain extensions and generalizations of this criterion and 
represent important analytical advances. The work reported in this 
paper was purely empirical, on the other hand, and was directed at 
assessing the effects of varying w widely, on three aspects of the 
final solution—variance dispersion, exemplification of simple struc- 
ture, and interpretation of the obtained factors. 


Method 


Comparing the Solutions 


For each set of data, the unrotated centroid matrix was trans- 
formed to several orthogonal simple structure solutions, the w param- 
eter being varied between 0 or less than 0 and m or greater than m 
(the exact values of w for each data set are given with the results). 
Tn each case, the normalized, as opposed to raw, form of the ortho- 
max criterion was used. The obtained factors were then matched 
with the factors of a graphically transformed solution, the latter de- 
termining the positions of the columns of all obtained factor ma- 
trices. The matching was accomplished by cross-correlating the fac- 
tors of each analytic solution with those of the graphic, using the 
following rationale. Let A, of order n X m, be the matrix of unro- 
tated (centroid) factors, В, of order n X т, the final transformed 
orthogonal solution, and T, of order m х т, the orthonormal 
(TT = TT’ = I) transformation, such that В = AT. Further, 
let subscripts a and g denote, respectively, analytic and graphic 
solutions, Then 


В. = АТ. 


B, = AT,. (1) 
Now, since both B, and B, are the results of orthonormally trans- 
forming the same initial matrix А, there exists an orthonormal 
matrix К, of order m x m, that maps B, into Ву. Thus, 


В, = В,К, or, from (1) (2) 
АТ, = АТ,К, and (3) 
кора (4) 
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Since both B, and B, are orthogonal solutions, element p, 9 
(p,q = 1,2, +++ , m) is the cosine of the angle between graphi 
tor q and analytic factor р. Thus, К = Т.Т, is also the шай 
correlations (rg, = соз ра) between the factors of the two solui 
The analytic factors were matched with the graphic, for each d 
set, by maximizing tr (K). 
Once the factors of a given solution were arranged to corre 
to those of the graphic solution, the common-factor variance t 
counted for by each factor in the solution was determined (Va 
[Factor р] = Уж"; p = 1, 2, +++, m). Solutions were co m 
pared, by studying the particular allotments of variance to thi 
factors in each and the overall equalization of the variance amoi 
the factors. D 
- Next, an attempt was made to assess the degree of simple 
ture exemplified by a given solution by studying (1) the h 
plane-counts (number of loadings, by factor and for the total so 1 
tion in the range 0 = .10) and (2) the previously mentioné 
correlations of the obtained factors with those of the graphic 
tion—converted to angular separations (65, = arccos rj) —and 
mean angular separation for a solution (over the m factors in 1 
solution). The assumption was thus made that a graphic sol 
was likely to be the best manifestation of simple structure for 
data set. j 
Finally, the interpretation of each factor for а given solution Wi 
studied. A factor was interpreted in terms of the variables found t 
load .80 or higher, in absolute value, on the factor. Solutions fi or 


EM set were compared in terms of how each factor was int 
preted, 


Data Used 


Four sets of data were used in 
number of variables and factors 
torial complexity of the variables: 


(a) Eight physical variables 
variables are highly reliable, 


this study —varying in (1) 8 
and (2) the reliability and fi 


(8 X 2). These anthropomet 


| with a median communality of 
Ate UM complexity of one in all solutions. 1 
м = " graphic orthogonal solution were obtained ff 
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(b) Twenty-four psychological tests (24 x 4). These well-known 
variables are of typical reliability for mental tests, with a median 
communality—an underestimate, of course, of reliability—of .47. 
Their factorial complexity—as determined for each variable from 
the number of loadings of .30 or larger, in absolute value, in the 
graphic solution (an imperfect index, of course, being somewhat 
dependent upon the communality and largely dictated by interpre- 
tive practices)—was moderately high, with 17 variables having 
complexity greater than one. The centroid matrix and graphic orthog- 
onal solution (obtained by a graphical method due to Zimmerman) 
were found in Harman (1960). 

(c) Wittenborn data (20 X 7). These variables, representing 
measures of attention, were analyzed by Wittenborn (1948), using 
a graphic orthogonal solution. The reliability of the variables is 
moderate (the median communality is .44) and they were fac- 
torially quite simple, only five of the 20 variables having complex- 
ity greater than one. 

(d) PMA data (57 х 13). These well-known variables were 
first analyzed by Thurstone (1938). The graphic orthogonal solu- 
tion used in this study, however, was performed by Zimmerman 
(1953), and was accomplished by starting from the point at which 
Thurstone had stopped (not having rotated all 13 factors) and fin- 
ishing the rotational procedures, obtaining a clearer resolution of 
the factors than in the earlier study. The 57 variables are highly 
reliable, with a median communality of .71. They also tended to be 
factorially quite complex, with 45 having complexity—as assessed 
by the procedure just described—greater than one. This apparent 
complexity was a function, in part, of the large communalities. 


Results 
Hight Physical Variables 


Four orthomax solutions were obtained for this data set, with w 
set to 0, 1, 2, and 4. The summary results are presented in Table 1. 
It can be seen from this table that the larger the value of w, the 
more evenly the variance was dispersed across the two factors, the 
quartimax solution (w = 0) showing the most unequal variance 
allotment, 

On the two simple structure criteria—hyperplane-counts and 


8 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 1 


Dispersion of Variance, Hyperplane-Counts, and Correlations and Angular 
‘Separations with the Graphic Solution for the Orthomaz Solutions 


of the Eight Physical Variables 
Variance Dispersion 
Factor 
Solution d п Range 
Graphic 3.352 2.612 .740 
ъ= 0 3.556 2.411 1.145 
1 3.317 2.649 .668 
2 3.314 2.651 „663 
4 3.311 2.654 .657 
Hyperplane-Counts 
Factor 
Solution I п Total 
Graphic 0 0 0 
w=0 0 0 0 
1 0 0. 0 
2 0 0 0 
4 0 0 0 
Correlations with Graphic Factors 
Graphic Factor 
Solution I II 
w=0 -99721 :99721 
1 -99990 .99990 
2 -99990 -99990 
4 .99980 -99980 
Angular Separations with Graphic Factors 
Graphic Factor 
Solution I п Mein 
و‎ А 4° 17 4° 17 
1 0° 49' 0° 49' 0° 49' 
2 0° 49' 0° 49' 0° 49 
4 1°9' 1° 9 1° 9 
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overall closeness to the graphic solution—discrimination among the 
orthomax solutions was difficult. No solution, including the graphic, 
had any entries on the hyperplane of either factor. The correlations 
between the orthomax and graphic factors were taken to five places 
of decimals to permit some discrimination. The quartimax solution 
was considerably further (4° 17’) from the graphic position than 
were the other orthomax solutions, the latter being almost identical 
to the graphic. Overall, it would seem that the solutions with w set 10 
1, 2, and 4 exemplified simple structure equally well, with the 
quartimax solution perhaps slightly inferior on this eriterion. 


= 
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With this particularly simple data set, the four analytic solutions 
and the graphic admitted to the same interpretation of the factors. 
That is, Factor.I would be interpreted in terms of variables 1, 2, 3, 
and 4, and Factor II, in terms of variables 5, 6,7, and 8. 


Twenty-Four Psychological Tests 


Ten orthomax solutions were obtained for these data, with w set 
to —8, —2, 0, 1, 2, 4, 8, 24, 48, and 96. Summary results appear in 
Table 2. The actual graphic, quartimax, and varimax solutions ap- 
pear in Table 3. Solutions obtained with w set to 2 (yielding an 
equamax solution) and 24 are presented in Table 4. From Table 2, 
it can be seen that, again, as w was increased, the variance dispersion 
tended to become increasingly more level, although with these solu- 


TABLE 2 
Dispersion of Variance, Hyperplane-Counts, and Correlations and Angular 
Separations with the Graphic Solution for the Orthomaz Solutions 
of the Twenty-Four Psychological Tests 


Variance Dispersion 


Factor 
Solution 1 п ш IV Range 
Graphic 3.240 2.570 3.272 2.374 .898 
w= -8 1.525 1.315 7.563 .980 6.583 
-2 1.560 1.343 7.490 .990 6.500 
0 2.056 1.759 6.245 1.323 4.922 
1 3.504 2.441 3.082 2.356 1.148 
2 3.579 2.607 2.715 2.361 1.218 
4 3.586 2.830 2.605 2.362 1.224 
8 3.582 2.923 2.509 2.368 1.214 
24 3.523 3.036 2.404 2.420 1.119 
48 3.114 3.029 2.327 2.913 .787 
96 3.051 3.040 2.339 2.952 .712 
Hyperplane-Counts 
Factor 
Solution I п ш IV Total 
Graphic 6 7 3 2 18 
w= -8 9 8 0 8 25 
-2 13 10 0 8 31 
0 12 12 0 9 33 
1 3 8 4 5 20 
2 3 5 3 5 16 
4 3 3 3 4 13 
8 3 2 3 4 12 
24 5 2 4 4 15 
48 7 2 4 6 19 
96 7 1 4 6 18 
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Correlations with Graphic Factors 
Graphic Factor 
Solution I II ш IV 

w= —8 .8149 .7560 .6319 .7164 

-2 „8236 „7663 „6660 .7137 

0 .8618 .8274 .8779 .7899 

1 .8968 .8358 .9990 .8374 

2 .8902 .8412 .9853 .8473 

4 „8858 „8429 .9809 .8542 

8 .8827 .8419 .9778 .8598 

24 „7687 „6166 .9288 „6865 

48 .5952 .7379 .6782 .7985 

96 .5805 .6919 .6360 .7744 

Angular Separations with Graphic Factors 
Graphic Factor 1 

Solution I II III IV Mean 
w--—8 35°25' 40°53' 50°49' 44°15' 42°51' 
-2 34°33' 39°59' 48*14" 44*28' 41°49' 
0 30°29' 34°10 28°37 87°49! 32°46 
1 26°16' 33°18' 2°34" 33°8' 23°49' 
2 27°6' 32°44’ 9°50’ 32%5' 25°26' 
4 27°43! 82°33! 11°13" 81°18" 25°42! 
8 28°2' 32°40 12°6’ 30°42’ 25°53" 
24 40°13' 51°56' 21°45' 46°39' 40°8' 
48 53°28' 42°27' 47°18" 3771 45°4' 
96 57°58' 46°13" 50°30’ 39°15’ 48°29! 


tions, the relationship was not perfect. The variance equalization of 
the graphic solution was exceeded by only two orthomax solutions— 
those with the highest values of w, 48 and 96. It is interesting to note 
that not only was there variability among the solutions in terms of 
equalization of variance over the factors, but the factor receiving 
the largest allotment of variance varied (Factor Ш for w = —8, 
—2, and 0, and the graphic solution; Factor I for w = 1 ‚ 2, 4, 8, 24, 
48, and 96) as did that accounting for the least variance (Factor 
TV for w = -8, —2, 0, 1, 2, 4, ава 8, and the graphic solution; Fac- 
tor ПІ for w = 24, 48, and 96). 

The hyperplane-counts presented would seem to have little cor- 
respondence with simple structure for the solutions with w = —8, 
=2, and 0, since, in these solutions, large counts, as one might ex- 
pect, were recorded for factors accounting for very small amounts 
of variance. For the solutions with fairly equitable variance disper” 
sion, however, the varimax solution (w = 1) had the largest 
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ТАВЉЕ 3 
Orthogonally Rotated Solutions, Using the Graphic, Quartimaz (w = 0), and 
Varimaz (ш = 1) Techniques, for the Twenty-Four Psychological Tests 
(Leading Decimal Points Omitted) 


Graphic Factor Quartimax Factor Varimax Factor 
I H-IH-IV a ЖЕ И тү ee HERI 


1 14 16 68 15 —10 03 72-07 14 19 67 17 
2 09 06 44 11 —05—03 46 —05 10 07 43 10 
3 11 —02 54 14 —03 —10 65 —10 15 02 54 08 
4 18 02 5 13 01—04 68-м 20 09 54 07 
5 78 05 23 33 62 14 51 03 75 21 22 18 
6 65 —0 23 ц 62 02 52 M 7 10 239 21 
7 77—03 22 83. 70 09 60 01 8 16 21 08 
8 55 12 40 23 38 15 69 —03 54 26 88 12 
9 вв —07 21 52 67-07 52 15 80 01 22 25 
10 зв 67 00-01 11 69 21 21 15 70-06 24 
11 31 63 13 14 08 66 36 28 17 60 08 88 
12 25 61 28 —15 —09 62 38 —1 02 69 2% 11 
13 37 45 46 —09 01 48 66 =11 18 59 И 06 
14 15 $9 05 4 14 13 29 45 22 16 04 60 
15 02 27 14 44 зи 12 07 14 60 
16 00 24 41 38 —09 —00 54 27 08 10 И 48 
17 05 48 07 5 05 15 33 57 м 18 06 64 
18 —03 48 38 37 —16 18 60 40 00 26 88 54 
19 08 27 24 33 OL 08 40 29 18 15 24 $9 
20 28 10 47 33 16—02 62 08 $ M 47 25 
21 22 88 45 15 —02 27 67 09 15 88 49 26 
22 25 10 41 44 18-07 9 20 36 04 М 86 
23 32 16 58 26 13 06 72 01 35 21 57 22 
24 40 44 2 26 291 36 50 22 34 44 22 34 
Variance 3.24 2.57 3.27 2.37 2.06 1.76 6.25 1.32 3.50 2.44 3.08 2.36 


hyperplane-count (20), even larger than the graphic (18). The 
varimax solution was also closest overall to the graphic, although the 
solutions with w = 2, 4, and 8, were almost as close and clearly in 
the same general position. The solutions with w = 24, 48, and 96 
were quite different, however, although somewhat similar among 
themselves. These latter solutions may well exhibit as clear a simple 
structure as those closer to the graphic, suggesting, perhaps, that 
closeness to a graphic position is not the only possible orthogonal 
position exemplifying a simple structure. 

Probably the most relevant basis of comparison lies in the com- 
parability of interpretation given the factors in the different solu- 
tions. The solutions presented in Tables 3 and 4 will be used for this 
purpose. The following tabulation gives the identifying number of 
the variables that would be used to provide an interpretation of the 
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TABLE 4 " 
Rotated Solutions, Using Values of 2 (= m/2—Equamaz) and 
24 (= бт) in the Orthomaz Criterion, for the Twenty-Four Psychological 
Tests (Leading Decimal Points Omitted) 


w=2 w= 24 
Factor Factor 

I и ушу Ty, 1 ПИ 
1 15 23 6 18 7 84 62 
2 и 0 42 1 14 17 40 
3 17 05 53 09 074 
4 2 nu 6 08 з 24 49 
5 75 23 18 13 72 87 08 
6 75 12 20 22 75 2 13 
7 62 18 17 08 79 85 06 
8 55 28 35 12 52 41 25 
9 80 03 19 26 CAAT 1:19. 
10 14 70 —10 22 03 62 —17 
11 17 62 04 34 09 64 00 
12 01 70 20 09 —0 68 10 
13 18 61 38 06 10 68 27 
14 22 18 01 49 24 10 06 
15 12 (09 12.51 и 02 19 
16 0 12 89 44 15 12 4 
17 14 20 03 64 M 1508 1l 
18 00 29 29 54 03 22 84 
19 13 17 22 89 16 15 25 
20 38 14 4 26 88 23 48 
21 16 40 39 2 14 43 $ 
22 $7 07 89 87 42 14 40 
23 $0 24 54 B 87 86 49 

34 46 18 33 $0 46 14 

3.58 2.67 2.78 2.36 3.52 3.04 2.40 2.4 


factors in each of the five solutions (that is, those variables with load 
ings greater than .30, in absolute value) : | 


Factor 
Solution I п ш Vv 
Graphic 5, 6, 7,8,9, 10,11,12,13, 123,48, 561791 
10, 11, 13, 28, мт, 18,21, 13, 16,18, 20, 15,16, 5 
j 21, 22, 23 19, 20 
Quartimax — 5,6,7,8,9 — 10,11,12,13 allbutl0and 14,15, М, 
г 14 | 
Varimax 5, 6,7, 8,9, 10,11, 12,13, 1, 2,3, 4, 8, 11, 14,1 
20, 22, 23, 21, 24 13, 16, 18, 20, 17, 18, 19) 
21, 22, 23 24 1 
w=2 56,789, 10,11,1213, 123 5, 
, ‚3,4,8, 11,1415 
(Equamax) 20, 22, 23,24 21,24 — ' 43 46/06, 91, 17, 12 9, 
id 22, 93 21 СЙ 
и = 24 pa 6, 7, 8,9, 157,810 1,2,3, 4, 16, 10,11,16 
, 22, 23, 24 11, 12, 13,21, 18, 20, 21,22, 16,17,18 
23 24 
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The varimax, equamax, and w = 24 first factors would be inter- 
preted identically, but differently from those of the graphic and 
quartimax solutions. The second factors of the varimax and equa- 
max (and probably the quartimax) solutions would be interpreted 
identically, but again quite differently from those of the graphic 
solution (which has, additionally, variables 14, 17, and 18) and the 
w = 24 solution (with variables 1, 5, 7, 8, and 23 additionally), The 
same situation is true for Factors III and IV, with the quartimax 
factors quite different, largely because of the very unequal vari- 
ance dispersion (more than 50% on Factor III, and less than 12% 
on Factor IV). It is probably true, of course, that the differences in 
interpretation noted are to a large extent a function of the differences 
among the solutions in both the equalization and specific allotments 
of the total variance. Thus, variance dispersion and factor interpre- 
tation are to some degree two sides of the same coin. 


Wittenborn Data 


Four orthomax solutions were obtained for this data set, with w 
set to 0, 1, 3.5, and 7. Summary results appear in Table 5. It can be 
seen from this table that, again, the size of w was directly related to 
the degree of variance equalization. The graphic solution for these 
data brought about far less variance equalization relative to the 
orthomax solutions than have the previously presented graphic solu- 
tions. As before, there was considerable variability over the solu- 
tions in terms of which factors accounted for the most and least, ete, | 
variance. 

With the possible exception of that of the quartimax solution, 
hyperplane-counts were very similar for the graphic and orthomax 
solutions. The varimax solution again was closest—in terms of 
mean angular separations—to the graphic, followed by the quarti- 
шах, equamax, and the w = 7 (т) solutions. The fact that the 
quartimax solution was closer, overall, to the graphic than were the 
equamax and w = 7 solutions would appear to be further evidence 
that closeness to a graphic solution may be a very imperfect index 
of simple structure for orthogonal solutions, since it is unlikely that 
the quartimax solution with the very unequal variance dispersion 
represents a superior simple structure to the equamax and ш = 7 
solutions. As with the Twenty-Four Psychological Tests, the inter- 
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TABLE 5 


Dispersion of Variance, Hyperplane-Counts, and Correlations and Angular 
Separations with the Graphic Solution for the Orthomaz Solutions 


of the Wittenborn Data 
Variance Dispersion 
Factor 
Solution 1 п ш тү У УТ VII Range 
Старые 1.265 .910 .815 2.641 1.236 1.510 .819 1.826 
w=0 1.303 .590 .919 3.291 1.222 1.082 .760 2.701 
1 1.427 .915 .959 2.254 1.320 1.426 .866 1.388 
3.5 1.351 1.102 1.031 1.655 1.401 1.558 1.068 .624 
7 1.308 1.112 1.082 1.485 1.452 1.574 1.154 .492 
Hyperplane-Counts 
Factor 
Solution I п HI IV V уу УП Total 
Graphic 10 10 12 3 11 8 7 61 
w=0 10 9 п 0 13 13 10 66 
1 8 7 12 5 10 9 11 62 
3.5 8 7 12 9 10 9 7 62 
7 8 7 9 11 8 9 6 58 
Correlations with Graphic Factors 
Graphic Factor 
Solution I п ш тү v VI VII 
w=0 +9966 .7848 .9566 .9628 .9738 .9470 .7628 
1 :9873 .8818 .9588 .9344 .9759 .9614 .8590 
3.5 - „9758 .7654 .9529 .8696 .9764 .9555 .8577 
1 9715 .7340 .9471 .8452 .9766 .9491 .8488 
Angular Separations with Graphic Factors 
Graphic Factor 
Solution I Ir ИЛА v VI II Mean 
w=0 4°45’ 38°17’ 16°56’ 15°41’ 13°8’ 18°44" 40°17 217 
1 979' 28°8' 16°80 20°52’ 12°36’ 15°58' 30°47’ 1999 
Bn 12°45’ 40°3’ 17°39 29°35’ 12°20’ 17°10’ 30°56’ 22°57 


13°43’ 42°46’ 18°43’ 32°18’ 12°25 18°22 31°55’ 24°19' 


pretation of a given factor, for these data, was somewhat dependent 
upon the particular solution in which it was found. 


PMA Data 


Six orthomax solutions were obtained for this well-known data 
set, with w set to —10, 0, 1, 6.5, 13, and 26. Summary results арревг 
in Table 6. As might be expected with a large number of factors, the 
matching of the thirteen factors obtained in each solution, with 
those of the graphic solution was fairly difficult, and was accom- 
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plished, in several cases, only by strict adherence to the rule of 
maximizing tr (K). 

As with previous data sets, the graphic solution of the PMA Data 
was obtained with less equalization of common-factor variance 
than were several of the orthomax solutions. Again also, almost a 
perfect inverse relationship can be seen between size of w and vari- 
ance equalization, using the range: (largest variance allotment- 
smallest allotment) as the index of equalization. One could, of 
course, use the variance or standard deviation of the variances as an 
alternative index. As before, the variance was dispersed not only 
more or less equitably as a function of w, but also differently. One 
can see, for example, that Factor X in the graphic solution had 2,66 
units of variance associated with it—resulting in eight loadings 
large enough (greater than .30) to serve in interpreting the factor. 
The corresponding varimax factor, however, accounted for only 1.00 
unit of variance—resulting in only two loadings greater than .30. 
The corresponding equamax factor had roughly as much variance 
(2.60 units) associated with it as had the graphic, and consequently 
had nine salient loadings. The orthomax solutions with w set to 13 
and 26 had more variance associated with this factor (3.16 and 
3.09 units, respectively) than had the graphic and, consequently, 
would allow a broader interpretation of the factor—with 11 and 
13 significant loadings, respectively. Thus, with w set to 1, Factor 
X accounted for little variance (the least variance of the 13 varimax 
factors) and would be narrowly interpreted, whereas with w set to 
13, for example, Factor X accounted for more variance than eight 
of the remaining 12 factors and would be broadly interpreted. Con- 
versely, varimax Factor VI can be seen to account for more vari- 
ance (4.06 units) than nine of the remaining factors, whereas the 
corresponding factor with w set to 13 accounted for more of the 
variance (3.10 units) than only five of the remaining factors. 

As with the previous data sets, it appears true with the PMA 
Data that simple structure was probably equally well exemplified 
in the orthomax solutions with w= 1 or greater. Again, there is evi- 
dence that hyperplane-counts signify little with orthogonal solutions. 
Also, with as many factors as in the PMA Data and the restriction 
to orthogonality, it seems true that, rather than there existing one 
Optimal position for the axes, to which the various analytic func- 
tions transform, more or less well (as would appear to be true for 


a 
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oblique solutions), there exist many possible positions that exem- 
plify simple structure equally, but not very, well. It is seen, for ex- 
ample, from Table 6, that virtually none of the orthomax solutions 
had axes in the same general position as the graphic solution, al- 
though the orthomax solution with w set to 13 was the closest. 

Since simple structure would appear to be somewhat of a constant 
—for w = 1 or greater—and quite likely a characteristic on which 
little if any choice can, for practical purposes, be made among sev- 
eral orthomax solutions, it may well be true that such a choice most 
logically should be made in terms of interpretability and specific 
interpretation of the factors among the several possible solutions. 
In Table 7, the graphic Factor XII and the factors from the ortho- 
max solutions with w set to 1, 6.5, 13, and 26 that were matched 
with this factor are presented, The reader will recall that this match- 
ing was accomplished by maximizing tr(K); obviously, from Table 
6, the match involving this graphic factor was not very close for the 
solutions with w = 1, 6.5, and 26, although the factor matched from 
these solutions was, in each case, that which was closest to graphic 
Factor XII. From Table 7, it can be seen that, if the factor is inter- 
preted in terms of the variables loading on it to the extent of .30 or 
larger, in absolute value, the Factor XII from the orthomax solu- 
tion with w set to 13 is almost identical 
factor, the interpretation in either case being a Visualization factor. 
The varimax Factor XII, however, is much more narrowly defined 

(1.72 units of variance, as opposed to 3.24 for the w = 13 factor), 

and has large loadings by only two of the variables that defined the 
viae] owe ee nia Lozenges A (variable 16) and 
eds е ) tests. This varimax factor may be some- 
P aracterized as a Visual Memory factor since its most 

c5 Pa are the memory tests, Word Recognition (vari- 
gure Recognition (variable 47). Each of the remain- 


ing two orthomax solutions—with w set : 
XII that would be both set to 6.5 and 26—has a Factor 


terpreted than would + 
for example, a verbal fac 


to the corresponding graphic 


А but rathe i inter- 
pretation each obtained tae ipic r how meaningful an inter 
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ј 
TABLE 7 
Factor XII of the Graphic and Orthomaz PMA Solutions 
| (Decimal Points Omitted) 
Solution 
Variable Graphic w=1 w=6.5 w=13 w= 26 
1 01 04 01 10 87 
2 16 22 05 24 49 
3 07 —03 12 10 20 
4 00 07 15 12 21 
5 15 02 82 23 13 
6 -14 -14 09 —05 —03 
7 —11 06 02 00 07 
8 16 09 27 16 13 
9 24 29 18 22 08 
10 04 02 19 10 —01 
11 —13 12 —07 — 04 —02 
12 —01 10 07 —03 —21 
13 —06 12 04 02 06 
14 80 —04 58 84 19 
15 30 —05 84 26 13 
16 54 80 81 61 66 
17 87 06 23 88 88 
18 62 84 64 65 40 
19 80 4 17 80 30 
20 85 01 32 83 29 
21 62 29 54 68 61 
22 40 12 66 47 25 
23 19 —05 20 13 14 
24 15 -22 29 16 10 
25 26 12 48 29 10 
26 26 17 $8 28 19 
27 13 11 28 18 14 
28 —03 —05 04 —02 —01 
29 01 09 —10 07 15 
30 11 04 02 02 —03 
31 04 —16 01 —05 04 
32 1 02 21 11 12 
н 33 —04 04 07 06 15 
34 —03 03 18 09 11 
35 21 05 13 15 19 
36 13 —01 83 15 12 
37 14 21 17 80 88 
38 —02 04 24 16 28 
39 —02 03 06 10 20 
40 21 20 21 25 37 
41 24 13 46 83 21 
42 13 —13 18 15 27 
43 04 1 08 02 —03 
44 08 26 06 13 12 
45 —01 20 09 —10 —06 
46 34 50 15 87 25 
4T 27 59 07 27 19 
48 04 16 05 01 —09 
k 49 13 22 —05 20 85 
50 n 19 02 07 —02 
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Solution 
Variable Graphic w=1 ш= 6.5 w=13 w= 26 

05 06 —07 08 15 
9 —04 1 —12 05 19 
53 —06 03 —06 00 02 
54 —09 07 —07 —01 04 
55 31 07 11 $4 56 
56 00 —08 03 01 05 
57 10 21 12 10 17 


Conclusions and Implications 


The following conclusions appear warranted from the results. 

1. In general, as w is increased, the variance dispersion among 
the factors tends to become increasingly more level (this possibility 
was first suggested by Saunders, 1962). Solutions with small values 
of ш (for example, less than 1) have large first factors, precluding а 
clear-cut simple structure. 

2. Because of conclusion (1), hyperplane-count is а poor index 
of simple structure for orthogonal solutions, at least if one includes 
solutions with w very small, since these solutions yield large counts 
because of small variance allotments to factors other than the 
first. 

3. There is little evidence to suggest that one special case of the 


orthomax criterion will, in general, yield solutions more closely 


aligned with a graphic solution for the data than any other, or, for 
that matter, to suggest that, for orthogonal solutions, this criterion 
corresponds closely to exemplification of simple structure. 

4. Simple structure would appear to be somewhat of a constant 
for orthogonal solutions with w set to 1 or larger. 

5. Interpretation of the factors (as with variance dispersion), 
however, can be expected to change substantially—partly because 
of the differences in (a) equalization of factor variance and (b) the 
variability in order of allotment of variance to the factors in a given 
solution—as w is varied. 

The implications appear clear. Since any orthomax solution те 
presents as mathematically legitimate an orthogonal transforma- 
tion as another, it would seem reasonable, if an orthogonal solutio? 
is desired, to obtain several orthomax solutions with w varied be- 
tween 1 and 2m or even larger (values of w less than 1 do not appear 
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promising). If one can specify a priori (perhaps from the purposes 
to which the factors are to be put) an optimal variance allotment 
(strict equalization is seldom optimal), the choice will be clear-cut. 
In the construction of a multi-factorial test, for example, the user 
could conceivably desire either a broader or narrower interpretation 
for a given factor than afforded by a single given solution. Barring 
the possibility of a preferable variance allotment, one obtained 
solution will undoubtedly have factors that are, in some sense, 
more interpretable, interesting, or in line with theory than those in 
the other solutions. Choosing this solution (which stops far short of 
a procrustean approach), then, would appear to be a less “blind” 
approach to orthogonal transformation than accepting the solution 
obtained by only one special case, for example, varimax, of this 
general criterion. 
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А FACTOR ANALYSIS OF THE EPPS AND PRF 
PERSONALITY INVENTORIES! 


ALLEN L. EDWARDS, ROBERT D. ABBOTT, anp ALAN J. KLOCKARS 
University of Washington 


Two multi-scale personality inventories have followed Murray’s 
need structure theory of personality in the formulation of their 
scale definitions and item domains. Edwards (1957a), on the basis 
of the list of manifest needs presented by Murray and others (1938), 
developed the Edwards Personal Preference Schedule (EPPS) which 
measures the strength of 15 needs: Achievement (ach), Deference 
(def), Order (ord), Exhibition (exh), Autonomy (aut), Affiliation 
(aff), Intraception (int), Ѕиссогапсе (suc), Dominance (dom), 
Abasement (aba), Nurturance (nur), Change (chg), Endurance 
(спа), Heterosexuality (het), and Aggression (agg). 

The EPPS uses a forced-choice item format in which two state- 
ments with approximately the same social desirability scale values, 
but representing different needs, are paired and the subject’s task is 
to select the alternative which best describes him. Scores on the 15 
EPPS scales are ipsative, that is, the sum of the 15 scores for each 
subject is equal to a constant. 

The Personality Research Form (РКЕ), developed by Jackson 
(1967), provides scores on 20 scales developed from Murray’s need 
definitions. The scales included in the PRF, Form AA, are: Abase- 
ment (Ab), Achievement (Ac), Affiliation (Af), Aggression (Ag), 
Autonomy (Au), Change (Ch), Cognitive Structure (Cs), Defend- 
ence (De), Dominance (Do), Endurance (En), Exhibition (Ex), 
Harmavoidance (Ha), Impulsivity (Im), Nurturance (Ми), Order 
_————— 
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(Or), Play (Pl), Sentience (Se), Social Recognition (Sr), Succor- 
ance (Su), and Understanding (Un). In addition, the PRF includes 
two stylistic scales: Infrequency (In) and Desirability (Dy). 

In contrast to the EPPS, the items in the PRF are in a true-false 
format and scores on the scales are not ipsative. Despite these differ- 
ences in item format and metrie, many of the PRF scales have the 
same trait names as those in the EPPS. Although the PRF Manual 
(Jackson, 1967) reports correlations between the PRF scales with 
scales in the Strong Vocational Interest Blank and in the California 
Psychological Inventory, the correlations of the PRF scales with 
those in the EPPS are not reported. The present study was under- 
taken to determine the degree to which corresponding PRF and 
EPPS scales are correlated and also the degree to which the scales 
have loadings on a common factor. 


Method 


The EPPS and Form AA of the PRF were administered under 
standard instructions to 109 male and 109 female students who par- 
ticipated in a test research project. In addition, scores on Edwards’ 
(1957b) Social Desirability (SD) scale, Welsh’s (1956) R scale and 
the Marlowe-Crowne (1960) MC scale were available for each sub- 
ject. These three scales have been found to be useful in identifying 
factors obtained when Minnesota Multiphasic Personality Inven- 
tory (MMPI) scales are intercorrelated and factor analyzed by the 
wee of principal components (Edwards, Diers, and Walker, 

The 40 scales were intercorrelated and factor analyzed by the 
method of principal components. Eleven factors, accounting for 71 


Be cent of the total variance, were extracted and rotated using 
Kaiser’s Varimax. 


Results and Discussion 


me correlations of each of the PRF scales with each of the 
Scales are given in Table 1.2 Of interest is the correlation of а 
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РКЕ scale with an EPPS scale which has the same or similar trait 
name. These correlations are shown in italics in Table 1. For exam- 
ple, the correlation between the PRF Abasement (Ab) scale and the 
EPPS Abasement (aba), scale is .40. All but two of the PRF scales 
have their highest correlation with an EPPS scale with the same or 
similar trait name. The two exceptions are the РКЕ scales de- 
signed to measure Achievement (Ac) and Abasement (Ab). The 
РКЕ Ас scale correlates .46 with the EPPS endurance (end) scale 
and only .25 with the EPPS Achievement (ach) scale. The PRF 
Ab scale correlates .40 with the EPPS Abasement (aba) scale and 
slightly higher, .42, with the EPPS Endurance (end) scale. 

Of the 20 PRF trait scales, there are seven which have trait 
names for which there is no EPPS counterpart with a similar or 
identical trait name. However, three of these scales, Cognitive Struc- 
ture (Cs), Harmavoidance (Ha), and Impulsivity (Im) have cor- 
relations of .58, .44, and —.54, respectively, with the EPPS Order 
(ord) scale. Defendence (De) correlates .41 with the EPPS Aggres- 
sion (agg) scale, Play (Pl) correlates .44 with the EPPS Hetero- 
sexuality (het) scale, and Social Recognition (Sr) correlates —.40 
with the EPPS Autonomy scale. The remaining scale, Sentience 
(Se), has relatively low correlations of .25 with the EPPS Intra- 
ception (int) scale and .28 with the EPPS Change (chg) scale. 

Table 2 gives the rotated and denormalized factor loadings of the 
scales on the 11 factors. Table 3 is an abbreviated table which shows 
only those scales with absolute factor loadings of .40 or greater on 
each factor. With but two exceptions, PRF and EPPS scales de- 
signed to measure the same trait have relatively high loadings on a 
common factor. The two exceptions are the PRF and EPPS scales 
designed to measure Achievement and Abasement. An examination 
of Table 2 also shows that, except for Achievement and Abasement, 
PRF and EPPS scales designed to measure the same trait have 
similar patterns of factor loadings across the 11 factors. 

For each of the 11 factors, there is at least one EPPS and PRF 
scale which serve to identify the factor. In some cases an EPPS scale 
has a higher loading on the factor and in other cases a PRF scale 
has the higher loading. In no case do any of the seven PRF scales 
for which there is no EPPS scale with a similar trait name identify 
factors which are not marked by one of the other 13 PRF or 15 
EPPS scales. 
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| TABLE 2 

ij Communalities and Rotated Denormalized Factor Loadings of the EPPS, РЕР, 
| and Marker Scales on 11 Factors 

Factors 

Seals I I Ш IV V VI VH VII IX X XI м 

ach 01 —25 —24 21 28 07 48 —16 —05 05—50 76 

[dé —39 01 16 —13 —00 06 07 08 71 00—10 71 

o ord —79 —21 —04 —11 03 15-07 10 10 —14 —05 75 

exh 12 —14 —01 09 —16 09 —09 —82 —04 —04 —05 77 

| aut 34 —60 —11 —05 —12 —06 00-13 —35 06 14 67 

_ af 20 45 38 —10—14 08—07 17—01 33 —01 56 

int —04 —02 07 —03 —09 —86 05 03 —05 —02 06 76 

ас 03 55-12 —10 —39 19 09 32 —09 —10 04 65 

_ dom —04 01—02 86 12 04 07-10 03-10 01 78 

аа 02 31-01 —64 08 03 10-16 25—20 17 68 

mE 25 50 35 —14 —14 —16 —06 35 15—01 11 67 

chg 25 —15 19 —08 00 —04 —01 —03 —03 77 07 74 

end —37 —31 01-21 70 15 03 12 03 —04 —04 81 

het 18 04—07 08—20 31—50 04-37 —14 —09 60 

га 18 —29 —51 31 02 02-07 —01 —09 —31 15 60 

Ab 27 20 29-28 10 —03 01 1 58—14 26 74 

Ас 02 —03 00 22 82—12 07 07 04 08 04 76 

Af 07 67 29 17 04 —05 —36 —15 —05 00—06 74 

Ag 00 —04 —76 25 01 07 —21 —04 —12 —05 02 71 

Au 30 —72 —08 —07 17-13 08 10-0 23 18 76 

| с 49 —12 —07 15 13-16-07 03 05 62 13 74 

| Ce —85 13 —04 05 07 01 07 —05 —02 —09 —04 77 

Ре —10 —06 —69 16 07 00 05-10-36 00 11 68 

Do —08 0-10 83 27 —05 —08 —20 —08 —06 —02 83 

En —03 —04 13 20 83 —13 06 03 —03 04 00 77 

Ez 23 25—21 44 09 05 —20 —51 —22 03—03 72 

Ha —54 29—11 —09 —31 18 30 12 09 —01 —03 63 

Im 75 10 —13 —06 —08 02 —21 -11 —14 17 77 

Nu 18 61 33 02 17—29 —11 06 08 11 12 67 

Or —T9 19 05 14 06 —10 —02 —07 —03 —04 09 71 

"s 24 23—01 00-0 27—64 —30 —16 17 00 77 

TABLE 2—Continued 

Se 21 12 07-0 32—48 —24 02-18 35—14 63 

Sr —97 52 —33 921-10 12-01-41 13 08 —25 77 

Su См 73 —17 —19 —23. 09 —03 00 —07 —11 09 69 

Un 14-14 01 17 25 —69 11 10 11 08—13 66 

| In 07 _10 —11 —05 05 06 05 03—02 09 76 63 

= Dy —11 14 59 41 24—18—07 03 —13 —07 —22 71 

SD —08 —06 65 40 15-10-15 00-17 01 –12 70 

R 04 —18 34 —27 17 17 60 17 —27 —12 05 76 

мо —14 01 66 07 16 05 04—05 09 07 32 6 


% Total 10.56 10.31 9.03 7.92 7.31 5.15 4.51 4.38 4.20 4.11 3.48 
1 % Common 14.88 14.53 1273 11.16 10.31 7.25 6.35 6.18 5.92 5.79 4.90 
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The results of this study show that the PRF and the EPPS scales 
share considerable common variance, despite the fact that the EPPS 
items are in a forced-choice format and the PRF items are in a true- 
false format and that the EPPS scales are ipsative whereas the 
PRF are not. Either one of the two inventories would appear to 
provide reasonable measures of most of the 11 factors obtained in 
this study. 
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THE USE OF HIERARCHICAL FACTOR ANALYSIS IN THE 
DETERMINATION OF CORPORATE IMAGE DIMENSIONS 


DARRELL E. ROACH 
Nationwide Insurance 
ROBERT J. WHERRY, SR. 
Ohio State University 


THE customer of companies which market intangible products 
or service has а very limited opportunity to evaluate the various 
product attributes. Consequently, such firms often rely on image 
building to establish and maintain an identity among their cus- 
tomers and prospective customers. 

The use of factor analysis and cluster analysis in the determina- 
tion of the basic dimensions has been demonstrated by Cohen 
(1963) and Spector (1964). The present study demonstrates the use 
of hierarchical factor analysis in the determination of the corporate 
image dimensions for multi-line insurance companies. 

A committee consisting of representatives from the public rela- 
tions, marketing, advertising, and research functions of such a com- 
pany was established. The committee developed 139 statements de- 
Scriptive of things (a) which the company or its representatives 
do or do not do and (b) which were felt possibly to contribute 
to the company image. | 

These statements were organized into a checklist questionnaire 
on which a respondent would evaluate each question on a five point 
scale as to whether such an acitivity would be very bad (rated 1) 
up to very good (rated 5) for a multi-line insurance company to 
engage in. This questionnaire was administered to 500 male heads 
of households in fourteen eastern states by ARB Surveys, New 
York, Editing and coding yielded 472 usable questionnaires for 
analysis. 

31 
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Due to the large number of items, the responses were analyzed 
by the Wherry-Winer method (Wherry and Winer, 1953) which 
requires pre-sorting of the items into trial factors. Eighteen such 
sets were reliably distinguished by a group of judges. In view of the 
rather large number of potential factors, it was decided that the 
primary factors should be further subjected to a Wherry hierarchi- 
cal rotation (Wherry, 1959). This procedure establishes simple 
structure in the primary factors by the addition of higher order 
factors representing broader abstractions underlying the responses. 
This technique has previously been found to aid in interpreting 
other multidimensional psychological areas such as morale (Wherry, 
1958). 

All of the originally proposed 18 factors were substantiated by 
the actual factor analysis. In addition, four second order factors and 
two third order factors were added by the hierarchical analysis. 
A FORTRAN IV program for carrying out the combined procedure 
is avilable from the Psychology Department, Ohio State Univeristy. 


The Hierarchical Structure 


Factors were named on the basis of the items which loaded 
significantly on them (the actual items will be shown in the next 
section). Lower order factors were located in the hierarchy on the 
basis of their loadings, 1.е., the loadings of their significantly loaded 
items, on the higher order factors. As each factor was considered 
its potential popularity was estimated by averaging the mean scale 
values of its highest loaded items (usually five items or in the case 
of higher order factors 10 items). The final hierarchical structure, 
together with the popularity values, is given in Figure 1. 

All items from the lower order factors loaded significantly on 
one or both of the two third order sub-general factors: 


I. Prestige through Quality Services or Products 
II. Prestige through Self-serving Manipulation. 


The average popularity value of the 10 items with highest loadings 
on Factor I was a very favorable 4.11, while the corresponding 
average for Factor II was only 2.90 (3 = neutral). Thus, an image 
based primarily on the Quality of Services and Products would 
be a very favorable one, while an image of engaging in sharp 
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Self-serving Manipulative Practices would lead to apathy or even 
mild disapproval. 

Factors contributing to an image based upon quality products 
and services included: 


competent agents, product quality control, alertness to new ideas, 
national scope, policyholder involvement, social responsibility 
and to a lesser degree certain aspects of the 
marketing program, sound investments, and community involve- 
ment. 
Social Responsibility was in turn based upon: 
agent contact, special services, and concern for employee welfare. 
The Marketing Program was evaluated in terms of: 
good claims service, low rates, tv and radio sponsorship, and 
use of mass marketing. 
Community Involvement had two subfactors: 
agents role in community affairs, and company financial involve- 
ment in the community. 
Factors contributing to the Image of Self-Serving Manipulations 
were: 
Competive Strategies 
and certain aspects of: 
marketing program, sound investments, and community involve- 
ment. 
Competitive Strategies in turn had three components: 
international involvement, political involvement, and cut-throat 
competition. 


From the above we can conclude that a given example of com- 
pany or agent behavior may influence several levels of attitude 
toward the company. Money spent to advertise certain behavior to 
help build the image of the company will probably do so, but 
whether that image will be good or bad will depend upon how 
heads of households judge that behavior. The reader or hearer 
will surely respond favorably if he believes the behavior will 
result in his getting better service, a better product, or that the 
company has his welfare in mind. On the other had he will respond 
unfavorably if he believes the activity will decrease his personal 
contact with the company or if the behavior indicates that the 
company believes that it cannot make it solely on the basis of 
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its services and product and hence is seeking to eliminate com- 
petition or to gain political favors in order to survive. 

In the section to follow we will present the items which loaded 
significantly on each of the original 18 primary factors, and in 
addition will give their loadings on higher order factors under 
which they were classified in Figure 1.1 


The Primary Image Factors and Their Loadings 


Factor А.Т. Competent Agency Staff. Items on this factor were 
loaded significantly only on their own primary and on Factor I. 
The items and their loadings were: 


Item 1 АЛ 

Agents consider themselves counselors rather 

than salesmen 56 Al 
Agents are well qualified to sell insurance 63 382 
Agents tailor policy to fit personal needs 58 ‚82 
Agents are familiar with all kinds of insurance 58 .32 
Agents really know insurance 61 .31 
Agents try to provide right protection rather 

than just to sell policy .58 .31 


The image is clearly one of well trained, professionally and 
client-centered agents. Favorability of this factor was a high 4.15. 

Factor ВІ. Product Quality Control. Items followed the same 
pattern of loadings as did those in A.I. The significant items were: 


L7 BML 
Never changes product until a better one has 
been proven Ў 46 .39 
Emphasizes quality of insurance it sells 58 38 
Doesn't come out with new type of policy 
until thoroughly tested 57 :95 
Most outstanding characteristic is reliability .62 33 
An old established company 304... 009 


The image is one of an old, established, reliable company which 
carefully guards against any possible lowering of the quality of 
its product. The average favorability rating was 4.07. 


CRS Sar +. D К 
1A complete listing of all items with all loadings is available from Dr. 


D. Е. Roach, Nationwide Insurance. 
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Factor СЛ. Alertness to New Ideas. Again these items show only ` | 


the two loadings. Actual items and loadings were: 


I О.Т. 

Constantly seeks new services to offer .62 42 

Has progressive ideas 61 .39 
Keeps up to date in all developments of the 

industry 65 33 

Ever alert for new ideas 69 82 


Here the image is one of a progressive firm which keeps its 
services up to date. The care in testing described in B.I. above 
did not prevent new services, but only assured their high quality. 
The average favorability rating was 4.07. 

Factor D.I. National Scope. Again the factor pattern is identical 
to that above. The significant items were: 


T D.I. 
Has branch offices in every state .49 48 
Has agents selling insurance in every state 51 46 
Has large regional offices all over the country .50 45 
Is continually expanding 46 48 
Doubled volume of business last year .50 84 


The image here is one of a large and growing company with 
enough facilities to make it easily contactable by any policy- 
holder wherever he may be. The average favorability index was 
3.88. 

Factor EI. Policyhold Involvement. The item pattern again 
shows only the two sets of loadings. Actual item loadings were: 


I D.I. 
Encourages suggestions from policyholders 57 48 
Наз put many policyholder suggestions into 
effect .55 48 
Has а formal program to stimulate suggestions 
from policyholders 58 45 
Is guided by reactions of policyholders 44 .36 


Keeps policyholders informed about company 
and insurance business .58 .32 
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The image is one of effective compnay-policyholder communica- 
tion. Information is given, reactions and suggestions are collected 
and then acted upon. The average favorability rating was 3.79. 


Factor ТЕЛ. Agents Contacts (Social Responsibility) This is the 
first of three primary factors which contribute to the second order 
factor of Social Responsibility. The items and their loadings were: 


I 


Agents call on policyholders periodically 49 
Policyholders are well acquainted with 


agents 54 
Agents are conscientious about providing 
| services .62 
Agents will go out of way to help 
policyholders .62 
Agents do little extra favors for 
policyholders 48 


favorability index was 4.01. 


were: 


Y 


Company spends large sums of money to 
provide services for customers 
Company gives sympathetic help when 
needed even at considerable expense .33 
Provides many services other than 


insurance, e.g., free home inspection .33 
Services have priority over profit .50 
Very conscientious about services to 


customers 


This image says that the company 
agent in the previous factor, shows 
welfare of its policyholders. The average 

Factor 3F.I. Concern for Employee 


F.I 
12 


.24 


13 


.21 


.28 


Р.І 


.26 


1F.I 
45 


.36 


.28 


.26 


.26 


The image painted is that of an agent who із а well known and 
obliging friend, who seeks the policyholder’s best interest. The 


Factor 2F.I. Special Services (Social Responsibility) This is the 
second of the three factors mentioned above. Its significant items 


2F.I 


44 


41 


.32 
.28 


.21 


itself, as contrasted to the 
a friendly concern for the 
favorability was 3.79. 

Welfare (Social Responsibil- 
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ity) This is the third and last of the Social Responsibility sub- 
subfactors. Its loadings were: 


I. FI ЗЕЛ 

President knows all employees by first 

name Jl  .27 .56 
Executives do personal favors for 

employees di 1.27 AD 
Provides free health services for 

employees 32 .28 48 
Sympathetic about personal worries of 

employees 45  .24 .36 
Executives look out for welfare of 

employees 55.24 .29 


The image is that of a company and its executives showing 
concern for its employees similar to that shown by the agents to 
policyholders in Factor 2F.1. The average favorability rating is 
3.59. It is possible that the goodwill earned by such practices is 
due either to а belief that (А) the company helps others, there- 
fore it may help me, or (b) the company is helping my friends,  ' 
the agents. | 

Factor 1.FG.I. Good Claims Service (Marketing Program) This 
factor is one of two which load both on FI Social Responsibility 
and GI Marketing Program and upon I. Prestige based upon 
Quality, Product and Service. Its items and loadings were: 


Item УОТ ТОТ 


Rarely rejects a claim made 28б 21. 47 
Makes liberal claim’s settlements .37 .25 .19 45 | 
Has claimsmen оп duty 24 hours | 


рег дау 45  .25  .20 38 | 
Never takes advantage of fine | 
print 29. 30 2887 | 
Settles most claims within 48 | 
hours 47 25 5 в | 


The company high on this factor settles claims fairly, generously, 
and promptly. The average favorability rating was 3.97. 
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Factor 2.FG.I. Low Rates (Marketing Program) The pattern of 
loadings is similar to that of Factor 1.FG.I above. Actual items 
and loadings were: 


Item I FI GI 2701 


Has lowest rates in the industry ҮНӨ КР 86) 45 
Rarely raises rates on present 


policyholders Mb. .22 18 41 
Rarely raises its rates 39, 13. 29 .39 
Charges lowest possible rates for 

services rendered 47...27 .28 97 
Keep rates down by narrow profit 

margin 43 17  .36 .35 


The image is one of a company which really tries to keep rates 
as low as possible. The average favorability rating was 3.96. 

Factor 3.G.I-II. ТУ and Radio Sponsorship (Marketing Pro- 
gram). The third of the factors on the G. Marketing Program 
sub-group, has loadings on both sub-general I and sub-general II. 
Manipulative Practices. Its items and loadings were: 


Item I II GI 3611 

Sponsors plays on TV ВЕ ВОЗ 67 
Sponsors sport programs/events 

on TV 73 .21 .65 
Sponsors radio programs .97 . .98 416 .61 
Sponsors programs on ТУ GL MIS ош .60 
Sponsors adventure shows on 

TV .30 .40 .23 „59 


The image is a mixed опе of а company providing a service 
{ог its policyholders and the public but which also exacts а price 
for the service, its advertising program. The average favorability 
rating was 3.55. 

Factor 4.GII-II. Mass Marketing (Marketing Program). The 
last of the four primary factors falling under the G.I. Marketing 
Program factor, has loadings on higher order factors G.I. and II, 
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but its loadings on I have become insignificant. The items and 


loadings were: 
Item I 
Sells insurance by vending 
machines 04 
Sells insurance by mail 09 
Sells insurance in department 
stores 11 


II 


37 
37 


37 


СІ 4611 
.29 67 
31 6 
.29 61 


Factor 1 Н.П. International Involvement (Competitive Strategy). 


This is the first of 3 primary factors which are grouped under 
Н.П, Competive Strategies, the only factor whose items all fall 
solely on the II sub-general. Its items and loadings were: 


Item 


Backs aid to foreign countries 

Attempts to aid in solving 
international problems 

Uses some of its resources to attempt 
to improve international relations 

Takes an active role in international 
affairs 

Invests money in foreign countries 


The image is that of a firm which is active in international 
affairs and spends its own money and urges government aid for 
foreign countries. Apparently the policyholders, actual and poten- 
tial, see this as a self-serving action designed solely to increase 
profits. Its average popularity value was only 2.98. 

Factor 2 Н.П. Political Involvement (Competitive Strategy). 
This second of the Competitive Strategy factors had the same 
factor pattern as the preceding factor. Its items and loadings were: 


Item 


Encourages employees to take an 
active part in local polities 

Many company executives participate 
in political affairs 

Takes a definite position on 
political issues 


II 
45 


52 


41 


51 
49 


Н.И 
.26 


.32 


.22 


.29 
.35 


ZI HIE 
.56 .08 
55 13 
62 .29 


1H.II 
.50 


48 


46 


46 
45 


2H.II 


44 


48 


* 
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Officers allow name to be listed as a 


supporter of a political party 57 26 
Lobbies in state and national capitals 
to influence legislation 58 35 


41 


40 


.32 


The image is one of a company which (a) takes its role as a 
citizen seriously but (b) tries to promote its own interests in the 
process, The average popularity rating of the high items was a 


rather lukewarm or even slightly chilly 2.78. 


Factor 3 Н.П. Cut-throat Competition (Competitive Strategy) 
This is the last of the competitive strategy factors. Its high items 


and their loadings were: 


Item II НЯ 
Company capitalizes on conditions in 
times of distress .36 .41 
Will absorb a competitor if it can 25 44 
Places profits as a top objective S3 ^ 41 
Does best for itself regardless of who 
it hurts .94 41 
Publicly announces rate changes only 
when they are decreased 45 44 


SHII 


41 


The image of blatant self-seeking, at the expense of others, 


received the lowest average popularity rating of only 2.56. 


Factor LI-II. Sound Investment Policy. This factor is related to 


both sub-general factors. Its high loadings were on: 


Item TET: 
Invests in many different enterprises 34  .27 
Makes only safe investments with low 
rate of return 45  .23 
Invests money so as to bring maximum 
return for policyholders' dollar .59  .12 


I.I-II 
54 


-50 


39 


This image of a prudent investor playing a minimum risk 
game is seen both as sound business practice for the company and 
as benefit of the policyholder. The higher loading on I as opposed 


to II was reflected in the average popularity rating of 3.79. 


42 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Factor 1 J.I-II. Agent Role in Community Affairs (Community 
Involvement) This factor is the first of two primary factors falling 
under the higher order factor J.I Ш dealing with community in- 
volvement. Its high items and loadings were: 


Item I TERTII 1J.1-II 

Agents belong to many com- 

munity organizations 421.271 24 44 
Agents play a large part in 

community affairs 46  .34 22 40 
Agents participate in com- 

munity affairs even though 

they lose valuable worktime .28 .42 21. .39 


Executives are urged to par- 

ticipate in community 

affairs .50 :28 33 .91 
Company exerts pressure to 

clean up slums in 

community 41  .26 a5 31 


This image of agent and executive concern for the welfare of 
the community is viewed more as a service function than as a 
manipulative device. The average popularity rating of its items 
was 3.85. 

Factor 2J.I-IL. Financial Involvement (Community Involve- 
ment) The pattern of loadings is like that of the preceding factor. 
Tts high loadings were: 


Item I и ЈП #11-11 


Agents сап be counted on to 

help with fund-raising 

drives 4l .34 .28 .52 
Company supports com- 

munity chest and fund- 

raising drives 48 26 .30 49 
Company invests money to 

stimulate the growth of 

the economy 58 Û) 19 43 


The image of a paternalistic interest in the community and its 


o OA 


| 
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economie well-being is also very well received. The average popu- 
larity rating of the items was 3.61. 


Discussion and Summary 


The above 18 primary factors speak for themselves. The panel 
which constructed the questionnaire was on the whole quite suc- 
cessful in selecting activities for rating which would arouse either 
favorable or unfavorable images in the average head of a house- 
hold. Only one of the 18 factors (1 H-II International Involve- 
ment) received a nearly neutral rating (2.98). 

The six most popular factors were: 


Competent agents (4.15) 

Product quality control (4.10) 
Alertness to new ideas (4.07) 

High agent-policyholder contacts (4.01) 
Rapid, generous claims service (3.97) 
Low rates (3.96) 


These factors all contributed quite highly to the sub-general factor 
of Prestige through Quality Services and Product. 

Only three types of activity were viewed unfavorably by the 
respondents: 


Political Involvement (2.78) 
Mass Marketing (2.61) 
Cutthroat Competition (2.56) 


These factors contributed heavily to the less favorable sub- 
general which we have labeled Prestige through Self-Serving Man- 
ipulation. 

We conclude that our obtaining of the higher order factors 
through hierarchical rotation has added substantially to the mean- 
ingfulness of the original 18 factors. The intermediate level factors 


served to unearth the perceptual rubries of: 
Social Responsibility (including 5 primary factors) 
Marketing Program (including 4 primary factors) 
Competitive Strategies (including 3 primary factors) 
Community Involvement (including 2 primary factors) 


These concepts point to the interaction of their component elements 
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in the process of image building. Finally our two sub-general factors 
provide the general themes: 


Quality Services and Products 
Self-seeking Manipulation 


around which good and bad overall images are built for large 
multi-line insurance companies. 

It is highly probable that these same concepts (with appropri- 
ately changed items) would be found to apply to other predomin- 
antly service rendering companies. 
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A COMPARISON OF ANALYSES USING THE FIRST AND 
SECOND GENERATION LITTLE JIFFY'S 


GEORGE P. HOLLENBECK 
The Psychological Corporation? 


Prrnorpan components analysis and varimax rotation of all com- 
ponents with eigenvalues greater than one (the Little Jiffy) has 
been perhaps the most widely used factor analytic procedure (Cron- 
bach, 1970). It has served as a “workhorse,” especially in the be- 
ginning exploration of an unknown domain. Recently, Kaiser (1970) 
has proposed a second generation Little Jiffy (LJ-II) which he 
suggested has several advantages over LJ-I. LJ-II provides a mea- 
sure of sampling adequacy of the variables under study; it uses 
'tmodel-free" Harris factors; it, at least theoretically, avoids arti- 
factual difficulty factors; it uses the "orthoblique" method for 
transforming factors, thus allowing correlated factors. 

The practitioner schooled in LJ-1 is reluctant, however, to give 
up that tool without some notion of how his obtained “order from 
chaos” would differ were he to use LJ-II rather than the old 
reliable LJ-I. Already uncomfortable with the fact that different 
factoring and rotational methods may provide different “orders” 
for the data, he is a little uncertain about using LJ-II. 


The present study examines а set of data with both LJ-I and 


LJ-II, comparing the results obtained using both methods. Although 
Kaiser has recently recognized that LJ-II is consistent in slightly 
underfactoring (mimeographed letter, February 1971), it was felt 
that the techniques represented in LJ-II are sufficiently interesting 


onducted while the author was a J. Me- 
the University of California, 
Meredith for his helpful com- 


1 The research reported here was © 
Keen Cattell Fund Postdoctoral Fellow at 
Berkeley. Thanks are due Professor William 
ments on a draft of this paper. 
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to make such a comparison worthwhile. Even though a comparison 
of LJ-I and LJ-II results from a well-defined, frequently-factored 
set of data might offer greater light concerning some of Kaiser’s 
reservations about LJ-II, a comparison of the two techniques on 
a "real-life" problem may highlight some of the differences and 
similarities in a less cut-to-order situation. 


Method 


The data analyzed were the test results from the normative sample 
of 142 five and five and one-half year olds collected for the stand- 
ardization edition of the individually administered McCarthy Scales 
of Children’s Abilities (in press). The 24 tasks making up the 
Scales are described in detail in the test manual. The tasks include 
verbal, numerical, perceptual-motor and gross motor tasks designed 
to measure a broad range of abilities in two and one-half to eight 
and one-half year old children. The task names (Table 1) are 
descriptive of the tasks themselves. The sample was stratified to 
represent, approximately, current population estimates of age, sex, 
color, geographical region, and father’s occupation. 

Product-moment correlations among the tasks were analyzed with 
both LJ-I and LJ-II. To compare the similarity among the factors 
derived from both methods, Tucker’s coefficients of congruence 
(Harman, 1967) among the loadings of the factor pattern matrices 
for the two analyses were computed. Since the Scales were designed 
for measurement in the two and one-half to eight year and one-half 
age range, the tasks varied widely in difficulty for the five-five 
and one-half year age group. Difficulty factors were examined 
through correlating the difficulty of each task (р, the mean per cent 
of possible score obtained in the total sample) with the rotated 
LJ-I and LJ-II loadings, and the unrotated LJ-I loadings. 


Results and Discussion 


Little Jiffy II provides several types of information not provided 
in LJ-I analyses. One of the most interesting is the measure of 
sampling adequacy (MSA), indicating the extent to which the 
domain under study is adequately sampled by the variables. The 
overall MSA for the data analyzed was .82, indicating “good” data 


2 Thanks are due Dr. Alan S. Kaufman of The Psychological Corporation 
for making these data available. T te и 
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according to Kaiser's tentative guidelines. Additional analyses not 
reported here verified that the data tend to be robust to changes 
in factor models, communality estimates, or rotation procedures. 
Kaiser presented no evaluation benchmarks for MSA’s for individ- 
ual variables. These MSA’s are presented in Table 1, along with 
the task means, SD’s, and task difficulties. The low individual 
task MSA’s (“low” taken arbitrarily to be less than .60) make 
“sense” in terms of the data and subsequent analyses: Tasks 1, 4, 
and 16 were easy tasks and have associated low SD’s; Task 4 
with the lowest MSA (—.346) had essentially no variance. Tasks 


TABLE 1 
Means, Standard Deviations and Measures of Sampling Adequacy of 24 Tasks 


Percent 
Tasks Mean Maximum SD MSA 
1. Block Building 9.55 +95 .96 .55 
2. Puzzle Solving 11.21 .42 6.38 .84 
3. Pictorial Memory 6.89 .57 1.71 .49 
4. Word Knowledge I—Picture 
Vocabulary 8.98 .99 .15 —.35 
5. Word Knowledge II—Oral 
Vocabulary 11.44 44 4.13 81 
6. Counting and Sorting I—Blocks 7.00 78 1.74 89 
7. Counting and Sorting II—Dot 
Cards 3.99 44 1.88 91 
8. Musical Memory айди 4.26 .43 1.83 .88 
9. Verbal Memory I—Words an: 
Sentences 23.08 7 "d n 
10. Verbal Memory II—Story 6.02 .55 ait ity 
11. Right-Left Orientation 6.66 56 Oe 2 
12, Leg Coordination a 11.42 .88 1. { 
"n Ата D ДД 2.17 .31 1.35 45 
14. Arm Coordination II-Beanbag 
Catch Game M BEN 3.56 .40 2.25 55 
15. Arm Coordination III-Beanbag ў 
Target Game 3.22 3 ii z 1 
16. Initiative Action A 4.55 $5 dm 06 
17. Drawing I-Draw-A-Design 8.39 2 ом ^ 
18. Drawing I DAC 10.07 . 
а о 5.94 50 1.73 т 
ub d c 1.48 18 п 89 
21. Verbal Fluency 14.59 5 ЊЕ © 
22. Number Questions ae E ms 4 
23. Opposites Analogies 309 s ah 93 
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18 and 14 appeared as a relatively cohesive independent factor 
across several analyses. 

LJ-I indicated seven factors, while LJ-II indicated six. Although 
this finding conforms to Kaiser’s subsequent conclusion that LJ-II 
slightly underfactors, additional analyses and inspection of the 
data seemed to indicate that the psychologically most, meaningful 
number of factors for these data is probably four or five. For these 
four or five factors, the results from both analyses were quite 
similar. 

The factor pattern matrices for LJ-I and LJ-II are presented 
in Tables 2 and 3, respectively. Loading 2.40 are presented in 
italic. 

Table 4 presents the coefficients of congruence between the rotated 
factor loadings for LJ-I and the loadings for LJ-II factors. The 
highest coefficient of each LJ-I factor with the LJ-II factors is 
presented in italics. АП of the LJ-I factors except one (Factor VI) 


TABLE 2 
Little Jiffy I Factor Pattern Matriz 
Factors 

Tasks I I ш ту У VI VII 
1 37 17 -29  —10 32 61 10 
2 64 —02 02 00 14 10 10 
3 12 —18 03 —09 81 ib —05 
4 —02 05 13 05 05 7 | —09 
5 40. —5 09 00 22 15 05 
6 60  —12 |. —190 -5 00 16 05 
7 582 .-n  —08 7 6 00 08  —02 
8 Sf 26 i13 yh gO. 039  —04 19 
9 20 —71 —03 08 —0  -—1 21 
10 5 —64 —17 17 04 32 25 
п 06 —21 18 =41. => и —10 
12 63 —09 —03 05 —14 —04 60 
13 -0 -4 23 —24 07 '—07 71 
14 и —-04 57 —30 37 01 29 
15 12 03 83 14 —10 10 07 
16 04 13 -02 —6в69 04 —01 08 
17 70 —16 02 —12 22 —01 04 
18 67 —10 13 00 05 —07 —02 
19 09 —72 03  —26 02  —25  -—06 
20 59 -6 14 -33 -04 —-о4 17 
21 51  —33 07 09  —01 03 14 
22 45  —40 09  —38 21 00 —14 
23 68 —48 7  —05 04  —09  —20 
24 71 —25 12  —10  —06 04  —0l 


Note.—Decimal points have been omitted. 
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TABLE 3 
Little Jiffy II Factor Pattern Matriz 
Factors 
Tasks I II ш IV v VI 

1 —04 —02 10 53 —01 03 

2 01 05 17 12 1 68 

3 15 26 —05 63 02 —00 

4 01 -12 —13 32 —24 14 

5 06 47 —02 26 —00 54 

6 —10 —02 00 08 47 22 

7 -17 —01 01 00 07 74 

8 07 06 —05 —01 69 —03 

9 02 92 63 00 21 09 
10 —04 92 69 34 —04 —00 
11 05 —08 —07 —13 20 10 
12 25 41 90 —00 —01 00 
13 46 20 38 06 21 —36 
14 64 —03 12 22 01 —00 
15 42 —19 00 —13 -и 44 
16 14 —32 —28 —00 39 -1 
17 00 06 —03 14 08 61 
18 0з —01 —02 —02 -14 75 
19 —01 65 03 —15 61 —00 
20 20 —08 —00 —10 41 30 
21 02 28 15 01 —08 66 
22 00 00 —58 05 49 51 
23 —06 27 —19 —09 —01 84 
24 —01 07 04 —09 06 71 


Note.—Decimal points have been omitted. 


had a coefficient of congruence of >.75 with one of the LJ-II factors, 
indicating considerable similarity among the analyses. For the most 
part, these coefficients among loadings provided the same conclus- 
ions as our subjective examination of factor similarity based on 
inspection of high loadings in both analyses. However, that in- 


TABLE 4 
Coefficients of Congruence between Little Jiffy 1 and Little Jiffy 11 Factors 
LJ-I Factors 

LJ-II Factors I п ш ТУ УТ VII 
I 11 —09 70  —32 25 | —00 66 

II 32 -8 —11 13 18 —01 40 

ш 18 —32 —07 28 | —08 02 76 

IV 20 —13 —14 —04 80 62 19 

v 32 —47 —22 —82 05 —19 18 

—48 20 = 14 15 512 


VI 86 
L ео 


Note.—Decimal points have been omitted. 
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spection indicated по LJ-II factors similar to LJ-I Factor IV, while 
the congruence coefficients indicated that it is similar to LJ-II 
Factor V. Combining our subjective examination with the coeffi- 
cients between the factors for the two analyses lead to the con- 
clusion that at least five, and perhaps six, of the LJ-I factors were 
represented similarly in the LJ-II analysis. 

The Kaiser scaling of the LJ-II pattern matrix is not presented. 
By making the mean squared loadings on each factor equal to 
one, the loadings greater than the average have values greater 
than one. This procedure directs the attention of the investigator 
to these “high” loadings. Unfortunately, a disadvantage of this 
procedure is that the size of the loadings, and hence those appear- 
ing large, i.e. 21.00, depends on the average size of the loadings 
for a given factor. As a result, same-sized conventional loadings 
in different columns may have different loadings when Kaiser- 
scaled, 

Factor intercorrelations for the oblique factors of LJ-II are 
presented in Table 5 (the varimax factors of LJ-I, are, of course, 
uncorrelated). Factors V and VI correlated highest, .78. 

There was not a clear difficulty factor apparent in the LJ-I 
rotated or unrotated factors, nor in the LJ-II factors. Correlations 
of factor loadings with the task difficulty levels varied from —.31 
to 32 for LJ-II. For the unrotated LJ-I, correlations with difi- 
culty ranged from —.31 to 34, while for the rotated LJ-I factors 
the correlations varied from —.48 to .46. The somewhat higher 
correlations of factor loadings with difficulty for the rotated LJ-I 
factors could be interpreted that LJ-I was more susceptible to 
difficulty factors. Data which present clearer difficulty factors should 


answer this question more adequately. 
TABLE 5 
Little Jiffy 11 Factor Intercorrelations 
Factor 

1 п ш Iv У 
п 29 
ш -22 -73 
IV -14 —46 42 
v 10 -33 61 46 
У 15 -2 64 47 78 


Note.—Decimal points have been omitted. 
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Summary 

Applying Kaiser’s first and second generation Little Jiffy's to 
the same data indicated considerable similarity in the results ob- 
tained. LJ-I produced seven factors, while LJ-II produced six. Each 
of the віх LJ-II factors was similar to one of the LJ-I factors 
(coefficients of congruence between factors >.75). Interpretations 
- based on high loadings for the two sets of factors would be quite 
similar for five of the factors, 

| Measures of the sampling adequacy for the overall data indicated 
that the data did sample the domain adequately. Additional 
analyses seemed to verify that conclusion, MSA’s for the individual 
variables were useful indicators of variables which should be ех- 
amined. Although in general the data did not produce difficulty fac- 
tors, there was some indication that LJ-II was less susceptible to 
task difficulty than LJ-I. 

For the data analyzed, the second generation Little Jiffy pro- 
vided useful information, much of it quite similar to that obtained 
from the first generation Little Jiffy. If data merit examination 
by factor analysis, then LJ-II would seem to be a useful adjunct 
to LJ-I. 
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SELECTING A SUBMATRIX LIKELY TO CONTAIN 
ONLY “DISTURBED” SUBJECTS! 


LOUIS L. McQUITTY, RICHARD G. BANKS, 
JEWEL M. FRARY, AN» CHARLES D. AYE 


University of Miami, Coral Gables 


A recent study (McQuitty, Banks and Frary, 1970) of matrices 
of interassociations between individuals shows that the interassocia- 
tions between “disturbed” individuals are less than those for “nor- 
mal” individuals. These differences seem to express themselves most 
in the more extreme relationships, such as either the lowest or the 
highest index for every disturbed versus every normal subject. 

This paper develops and applies several criteria for determining 
which of two submatrices contains more disturbed than normal sub- 
jects. A matrix of an equal number of disturbed and normal subjects 
is divided into all possible submatrices of equal size and the cri- 
teria are applied to predict which pair yields the best differentiation 


between disturbed and normal subjects. 


The Approach 
Subjects, data, and theory are outlined fully elsewhere (Ме- 
Quitty, Abeles, and Clark, 1970). Only a brief summary is required 
here. 


The Data 

The data are the answers to a test of 180 items by eight “dis- 
turbed” and eight “normal” subjects. Every item associated an emo- 
tion with a personal concept as illustrated by: (1) “The word 
mother suggests hope. yes по ?" and (2) “The word father suggests 


hate. yes no ?.” 


1 This investigation was supported by Public Health Service Research Grant 
No. MH 14070-03 from National Institute of Mental Health. 
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The 13 concepts are: (1) control, (2) self, (3) marriage, (4) 
religion, (5) father, (6) achievement, (7) woman, (8) closeness, 
(9) distance, (10) dependency, (11) sex, (12) man, and (13) friend. 
The 10 emotions are: (1) fear, (2) loneliness, (3) love, (4) hate, 
(5) guilt, (6) hope, (7) anxiety, (8) anger, (9) frustration, and (10) 
depression. 


Theory 


Psychological disturbance is assumed to express itself in emotional 
components of interpersonal relations, such as in responses to the 
above kinds of items. Disturbed individuals give some relatively 
unique combinations of responses. The area of expression usually 
pertains only to a few items and varies from individual to individ- 
ual. 

It is hypothesized that disturbed individuals are relatively unlike 
both one another and normal individuals. They lack relatively high 
indices of association with either themselves or others. As a group, 
their high indices of interassociations are less than those for normals. 


Agreement Scores 


Two persons have an agreement on an item of a test if and only if 
both of them answer the item either “yes,” “no,” or “?.” The agree- 
ment score between any two subjects is the number of items on 
which they have an agreement, 

A matrix was prepared (Table 1) to show the agreement score of 
every subject with every other subject; the diagonal cells are void. 

The matrix of Table 1 was divided into all possible pairs of sub- 
matrices of eight subjects each, with the one requirement that no 
submatrix reproduced any previous submatrix in terms of the eight 
subjects it contained. The entries of the submatrices were the agree- 
ment scores from the corresponding cells of Table 1. Measures were 


then applied to select the submatrix most likely to contain only 
disturbed subjects. 


Subject Socres 


The agreement scores in each column of each submatrix were as- 
signed ranks one through seven with one going to the largest score. 
Four measures were computed for each column (i.e., each subject), 
viz.; the means of agreement scores for (1) Ranks 1 and 2; (2) 
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TABLE 1 
Agreement Scores between 16 Subjects over 130 Items of a Test 


1.2, 8 4 5.106,07 8810128 14 05/716 
101 46 83 84 86 81 95 95101 88 87 100 82 86 71 
101 19 79 93 83 86 84100 94 106 73 115 88 84 79 
46 19 58 36 56 49 60 44 45 18 48 30 51 39 16 
83 79 58 81 80 78 84 81 75 71 71 85 86 77 56 
84 93 36 81 84 86 93 101 86 83 72 293 92 73 80 
86 83 56 80 84 87 90 93 87 78 81 94 84 92 66 
81 86 49 78 86 87 84 97 76 80 76 89 90 75 64 
95 84 60 84 93 90 84 104 96 77 83 93 98 79 70 
95 100 44 81 101 93 97 104 98 91 85 99 93 80 76 
10101 94 45 75 86 87 76 96 98 83 81 91 86 78 69 
11 88 106 18 71 83 78 80 77 91 83 69 104 80 84 84 
12 87 73 48 71 72 81 76 83 85 81 69 83 72 75 61 
13 100 115 30 85 93 94 89 93 99 91 104 83 94 90 80 
14 82 88 51 86 92 84 90 98 93 86 80 72 94 79 72 
15 86 84 39 77 73 92 75 79 80 78 84 75 90 79 69 
16 71 79 16 56 80 66 64 70 76 69 $84 61 80 72 69 
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Ranks 1, 2, 3, and 4; (3) Ranks 1, 2, 3, 4, 5, and 6; and (4) Ranks 
1, 2, 3, 4, 5, 6, and 7. 


Submatrix Scores 


Analogously, there were four scores for each of the two subma- 
trices. These were obtained by taking the mean of each measure for 


the eight subjects of every submatrix. 


Difference Score 


The difference score is the absolute value of the difference between 
the mean for one submatrix and the mean for the other submatrix. 


Selecting the Submatrices 


The difference score was applied to select for each measure the 
pairs of submatrices which gave the first, second, third, fourth, and 
fifth largest difference scores from among all possible pairs of sub- 


matrices. 


Significance of Results 
The number of distrubed and normal subjects in each submatrix 
was recorded. 


Let 
(1) т = for each pair of submatrices, the number of “disturbed” 
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subjects in the one (or two) matrices with the largest number 
of them. 

(2) r = the number of subjects in each of the two submatrices 
(8 in this study). 

(3) п = the number of subjects (disturbed and normal) in the 
matrix (16 in this study). 

(4) Р = probability of z or more disturbed subjects in a sub- 
matrix of size r by chance alone. 

(5) С = number of combinations of n subjects taken r at a 
time = [n!/rl(n — r)!] (in this study, [16!/8!-8!] = 12,870). 

(6) с, = the number of combinations of a submatrix of z or 
more disturbed subjects. 

(7) c,, = the number of combinations of г subjects taken = 
disturbed subjects at a time when only the normal subject(s) 
are interchanged between the two submatrices = [r!/z!(r—2)!] 

(8) с,, = number of combinations of r subjects taken x disturbed 
subjects at a time when only the disturbed subject(s) are 
interchanged between the two submatrices = [rl/z!(r — 2)!] 

(9) са = the number of combinations of r subjects taken 2 
disturbed subjects at a time when both disturbed and normal 
subjects are interchanged between submatrices = [r!/ 
al(r — z)? 

(10) c, = the number of combinations of г subjects taken 2 
or more disturbed subjects at a time = [rl/zt(r — a)!] + 
[r/(z + 1)!(т — 2 — DY] + frl/@ + 2) — 2 2)... + 
[r/rtr т)? 

(11) P = o/C 


The same answer is obtained by an application of the hyper- 
geometric probability function, where: 


N = number of subjects (disturbed and normal) 
п = number of subjects drawn without replacement 
k — number of normals (failures) 
z — number of normals in the sample 
F(x) = the probability of = or fewer failures 
SUED 
Ка) = У; r @) T (Beyer, W. H., 1966) 


т=0 


LOUIS Г. McQUITTY, ЕТ AL. 57 
Results 


Table 2 shows the first, second, third, fourth, and fifth largest 
difference scores for each measure and the number of disturbed 
subjects in the submatrix having the smaller score, For example, the 
largest difference score for Measure No. 1 is 25.4; the matrix with 
the smaller score contains seven of the eight disturbed subjects. In 
only five out of 1,000 trials would this result be determined by chance 
alone. 

The largest difference score for Measure No. 2 yields the same two 
submatrices as just reported above. No other measures yield this 
degree of differentiation between the two categories of subjects. The 
degree of differentiation decreases from the above level as either the 
size of the difference score decreases or the number of variables used 
in the measures increases. 


Interpretation 

Disturbed individuals are less interrelated with one another in 
their relatively high indices of association than are “normal” indi- 
viduals. This fact can be used to divide a matrix of interassociations 
between people into two submatrices of such a nature that one con- 
tains predominantly disturbed individuals and the other one con- 
tains predominantly normal individuals. 

The above conclusion is based on results from only 16 subjects 
and with test items selected to be uniquely appropriate to assess the 
degree of psychological disturbance reflected by the particular sub- 
jects of the study. However, the disturbed and normal subjects differ 
relatively little from one another in the sense that both categories 
were functional college students. The members of the former cate- 
gory had been judged by their clinical counselors at the Counseling 
Center at Michigan State University to be relatively disturbed and 
members of the latter category had been judged by their clinical 
counselors to be relatively normal (McQuitty, Abeles, and Clark, 
1970). 

Results from the first study referenced herein helped to generate 
the methods of this study. However, the same basic hypothesis was 
investigated in both studies. 

The current study needs to be repeated with other samples of sub- 
jects and test items. 
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Summary 


A matrix of interassociations between an equal number of normal 
and disturbed subjects in terms of test items especially prepared to 
assess their psychological disturbance was divided into all possible 
pairs of submatrices of equal size. A measure was developed and 
shown to be effective in selecting two submatrices containing pre- 
dominantly disturbed subjects in one submatrix and normal sub- 
jects in the other. 
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WEIGHTED CHI SQUARE: AN EXTENSION OF THE 
KAPPA METHOD! 


JACOB COHEN 
New York University 


In the usual analysis of contingency tables by means of х2, it is 
frequently the case that the investigator has a priori expectations 
that a specified few of the cells will have larger than chance pro- 
portions (or frequencies), while he has no partieular hypotheses 
with regard to the remaining cells. Or, if his theory is stronger, he 
may be able to articulate his hypotheses about the outcome for each 
cell with greater refinement, distinguishing chance and larger and 
smaller than chance expectations in varying degrees. In either case, 
neither the measure of association he computes (if any) nor the x? 
test he performs takes into account in any way his a priori hypoth- 
eses about the contingency table. They merely index the degree and 
significance of the collective departure from chance, in all directions 
and degrees indiscriminately, of the values he observes in the cells 
when he has collected and organized his data. 

This article presents а very general method for the study of m- 
way tables of proportions or frequencies (where m is one or more) 
in which the investigator's a priori hypotheses about the cells are 
expressed numerically and used as weights. These weights are then 
used in x», an index of hypothesized association, and also in a test 
of its significance, weighted x? (x«?) , which thus utilizes as relevant 
information the investigator's hypotheses (anticipations, hunches, 
theory). 


1The material on which this article is based was presented at the Oxford 


University meeting of the European Society for Multivariate Experimental 
Psychology (July 1969) апа the American Society for Multivariate Experi- 
mental Psychology (November 1969). The author is grateful to his col- 
leagues for their stimulating discussion of the material. 
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The system to be described is a direct outgrowth of work which 
was initiated to provide a coefficient of agreement for nominal 
scales, e.g., for two clinicians making one of four diagnoses for № 
patients, or two coders making one of six categorical placements of 
survey interview responses of N respondents (Cohen, 1960). The 
measure which was proposed, x, is simply the proportion of agree- 
ment for the N cases placed in the k categories by the two judges, 
corrected for chance agreement (Cohen, 1960, Formulas 1 and 2). 
Standard error formulae for significance testing and setting con- 
fidence limits were also presented (see below), and since for large 
samples к is approximately normally distributed, statistical tests 
and estimates take the familiar classical form. 

An obvious characteristic of к is its implicit treatment of all dis- 
agreements as equal. Thus, in a study of the interjudge reliability 
of diagnosis, the pair neurosis-schizophrenia assigned by two clini- 
cians to the same patient is counted exactly the same way as the 
pair neurosis-personality disorder. If it is the sense of the investi- 
gator that such equality of treatment of all disagreements lacks 
fidelity to his purpose, к is deficient. Many circumstances exist in 
which some disagreements are of greater gravity than others, or, 
equivalently, where the investigator wishes to distinguish degrees 
of partial agreement for which he wishes to give part credit. 

To meet this need, weighted x(x») was invented (Cohen, 1968). 
For each cell of the usual k Х k table of proportions (or frequen- 
cies) of joint judgments, a weight is assigned, a priori, to carry the 
information of degree of gravity of disagreement, or, alternatively, 
the amount of agreement credit. xw is interpretable as the propor- 
tion of weighted agreement corrected for chance. Its formula can be 
written either for disagreement (vy) or agreement (wy) weights, 
which are taken to be non-negative values in a ratio scale. кь, 
like «, is large-sample normally distributed and comes with standard 
error formulae for estimation and significance testing (see below). 
к is readily seen to be that special case of ка where only two weights 
are used, one for the k cells in the agreement diagonal and another 
for the k? — k off-diagonal disagreement cells. 


Further Generalization ој ко 


Up to this point, the thinking about x, was in a rigid framework 
in which the same categorical set of К alternatives appeared on both 
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axes of a two-way table, e.g., two judges making independent diag- 
noses using the same k possibilities. Even the extension to validity 
(Cohen, 1968) stayed in this framework, e.g., computer diagnosis 
tested against expert panel consensus diagnosis, but each mapping 
cases into the same category set. It was soon realized, however, that 
nothing in the mathematical-statistical development of the xy sys- 
tem required the constraints imposed by the narrow reliability 
applieation from which 1 began. One need not think in terms of 
“agreement” and “disagreement”, nor is there any necessary restric- 
tion to bivariate (2-way) tables, nor do the weights need to be con- 
strained to non-negative values in a ratio scale. АП that is required 
is that each cell of а table have an a priori weight, а null- hypo- 
thetical chance proportion and an observed proportion of N 
observations (or the respective frequencies) and xw can be meaning- 
fully computed and statistically manipulated. In this larger frame- 
work, ко and, more particularly, its significance test, weighted 
х (хь?) are put to work to provide a very general system for study- 
ing differentiated composite hypotheses about m-way tables of 
proportions or frequencies. 
Statistical Development 


Assign, a priori, a set of hypothesis weights uy’, one to each cell, 
which collectively carry the investigator's hypotheses or expecta- 
tions, The u; are not constrained to non-negative values, пог do 
they represent “agreement” or “disagreement” values as in reliabil- 
ity applications. They are any set of real numbers (not all equal) 
such that large values are assigned to cells where it is expected that 
ру > p.p. and small values where it is expected that py < «p. 
These weights can be more or less articulated as will be seen below. 

Since xw and xy? are invariant over any linear transformation of 
the weights, for convenience and with no loss of generality we re- 
scale the uy’ into a 0-1 interval by 
ij! — Мина" (1) 
Umax 7 Umin 
where Umax’ is the largest weight used and Umi’ the smallest. 

Now find 


и; = 


р. = »: Xo (2) 


iei de 


64 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
and 


= D p ATR (3) 


i=l jl 

where r is the number of rows, k is the number of columns, py is 
the proportion of cases observed in the i, j cell, and p,p, is the 
product of the proportions of row 7 and column j and therefore the 
chance-expected proportion for the 1, j cell. p, is thus the proportion 
of hypothesis-weighted observed association for the table and p, the 
proportion of hypothesis-weighted chance association. 

кь can now be defined as before (Cohen, 1968, Formula 1): 

e. ө 

Both the original approximate standard error formulas for 
к and кь given in Cohen (1960, 1968) are incorrect, but.in a con- 
servative direction, i.e., too large. A different approximate formula 
for the standard error of ко given by Everitt (1968) also proved to 
be incorrect. The correct formulas are given by Fleiss, Cohen, and 
Everitt (1969). The formula for the large-sample approximation to 
the standard error of ко under the null hypothesis given there (For- 
mula 9) is ê iy 


r 


са = iL E NZ ру (2 Жар alus — (а. +a) — 2 |" 


у ©) 
where 
k 
к и. = Lup, 
РЕП 
а column-proportion weighted average of the + row weights, and 
"ы D upes 


a row-proportion weighted average of the j column weights. 

Since ко is approximately normally distributed for large samples, 
the testing of Н, : x» = 0 can take the form of a conventional 
normal curve test, i.e., 


Рене ae (6) 


IO ан 
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However, we can capitalize on the fact that the standard normal 
deviate squared is distributed as x with one df and cast the result 
in the x? form traditionally used in the analysis of frequencies and 
proportions: 
2 2 Ко и 
=x = (=) à (7) 

where xu? is distributed as ха with one degree of freedom, no matter 
what the dimensionality of the table. 

When Formulas 4 and 5 are substituted in Formula 7 and sim- 
plified, the result is 


Ti" rs ® 
ЭЎ DY pi р. Йш: — @.. + a)! > рг 


i=l j=l 


Xu 


The xw? of Formula 8 is offered for use in all contexts where x? is 
now used with frequency and proportion data, and where the in- 
vestigator wishes to improve the power of the statistical test by in- 
cluding his hypotheses (hunches, expectations) about the outcome. 
кь is an incidental benefit in this scheme in that it provides а mea- 
sure of hypothesized association, a “rho” measure (Cohen, 1965, 
pp. 101-106). But хи, although a test of the significance of «y, can 
be used independently as a test of a specified alternative hypothesis 
to the usual null hypothesis of no association. As noted, this alter- 
nate hypothesis is expressed in the form of a set of weights assigned 
to the cells of the table. Obviously, these weights must be assigned 
before the cell proportions or frequencies are known.? 


The Hypothesis Weights 


The fact that the set of шу weights can be any set of real num- 
bers, positive, negative or zero, whole or fractional, confers upon 
their application great flexibility. This property makes it possible to 
express hypotheses about tables with a high degree of fidelity to the 
strength of the theory, the investigator's confidence, and the pur- 
pose of the analysis. To illustrate this, several strategies of assigning 
щу will be described by means of an т X k (ie. two-way) con- 

2 Note that it is the observed cell values which must not be known. The 
marginal values (univariate frequency distributions) of contingency tables, 
which provide the chance values of the cells may be known in advance of 
setting the hypothesis weights. 
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tingency table. Generalization to m-way tables will be discussed 
later. 


Illustrative Example 


Assume that an investigator plans a study of the relationship be- 
tween psychiatric diagnosis (r = 3) and pattern of employment 
history (К = 4). The diagnostic alternatives are schizophrenia (5), 
neurosis (N) and personality disorder (P), and the employment 
alternatives are steady job (J), many jobs (M), little work (L), and 
no jobs (O). Data are to be available for N = 100 cases. 

Table 1 gives the 3 x 4 layout. For compactness of presentation, 
each cell contains its observed proportion (pj) and, in parenthesis, 


TABLE 1 
A 8 X 4 Contingency Table ита пау to Pattern ој Employment 
= 100 


Employment 
pi 
.30 
Diagnosis .40 
.30 
Ра 1.00 
apis 
pip 


its chance proportion (p.p) of the total N, but this information is 
not (and the py certainly should not be) available to the investi- 
gator prior to setting the hypothesis weights (see Footnote 2). 

In deciding upon the uy values for the cells, the investigator is 
guided by the strength of his hypothesis, i.e., its degree of refine- 
ment or articulation and the strength of his convictions, Since the 
шу can be any real numbers, he may have as much or as little re- 
finement as he finds suitable. Generally, he assigns larger шу values 
to cells where he expects py > p,p.y and smaller шў values where 


ме. 
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he expects ру < р.р.;. The greater the number of different values 
employed, the more articulated the hypothesis system. 

Consider first the simplest possible system of weights, one which 
uses only two values, 0 and 1 (Table 2, Column 1). These are already 
u values, so they require no rescaling with Formula 1. Prior to data 
gathering, the investigator considers the 12 (empty) cells of Table 
1. His hypotheses lead him to expect more than a chance-expected 
proportion of the cases in cells c, d, e, f, g, j, and k. Column 1 of 
Table 2 assigns Is to these cells and Os to the others. When the data 
have been gathered, the ру values, and from them, the р, р.у values 
are found; they are given in Table 1. Substitution in Formula 8 
gives 

a _ 100(.73 — .63)? 


Xe = 5.50680 cog о 


А format to facilitate the computation of the double summation 
term in the denominator is given in Fleiss, Cohen, and Everitt 
(1969), although it is hardly needed when 0-1 weights are used. 

Substitution in Formula 4 gives 

„73 — .63 
к» = 1695 +.270. 

xu, which is assessed by reference to the x distribution for 1 df 
is significant at P < .005. The investigator would conclude that 
the association, as hypothesized, is significant. The к» of +.270 
provides a measure of the degree of hypothesized association. 

The concept “hypothesized association” requires some scrutiny. 
Since there are as many и weights as there are cells, the hypothesis 
in question is a composite of the separate cell hypotheses, and when 
опе concludes significant association “as hypothesized,” it does not 
necessarily mean that all the cell hypotheses were valid, but only 
that their weighted average is valid, 1.е., that the alternate com- 
posite hypothesis is generally valid as against the null hypothesis. 
For example, note in Tables 1 and 2 that the investigator’s hypo- 
theses (expectations) of large p; relative to p.p, were not met 
with regard to cells f, g, and К; nevertheless, ҳи? was significant 
because the set of u values was more true than false, hence gen- 


——— 

* If "exact? P values are desired, find Уже = z, giving z the sign of xw, and 
refer to the normal distribution. For the example, 2 = 4/9.183 = +3.03, Ру = 
+0012, Р, = .0024 (one-tailed and two-tailed, respectively). 
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erally correct, to a sufficient degree (x, = +.270) and with large 
enough ЈУ to be significant. 

Th using 0-1 weights, we actually revert mathematically to un- 
weighted x (Cohen, 1960), but instead of agreement between judges, 
what is here being indexed is the chance-corrected proportion of 
cases occurring in the cells predicted. If all the cases fall in the 
u = 1 cells and none in the и = 0 cells, x, i.e., ко with weights of 
0 and 1, = +1.00. 

The 0-1 weighting strategy is obviously the simplest possible, and 
is appropriate in circumstances where the theory is relatively weak 
and unarticulated. Consider now an alternative possibility that our 
investigator is prepared to express his theory in somewhat greater 
detail a priori. A second set of weights, шу (Table 2) has been as- 
signed so that +1 is placed in cells where it is hypothesized that 
Фу — Тера will be positive, —1 where it is hypothesized that it 
will be negative, and 0 where either there is no expectation or a hy- 
pothesis of no difference, Substituting in Formula 1 to obtain the 


TABLE 2 
Alternative Sets of Hypothesis Weights (изу, шу) 
Set 
1 2 3 4 5 
TURPE SCIRE TY Жена DP и ас. ._ 
Сеп ш uw Us us! из шщ! щ us! us 
a 0 1 0 —.5 .25 5.6 111 —.25 167 
b 0 ==] 0 =1.0 0 =.8 —.50 
с 1 +1 1.0 T.5 .75 +.5 722 +.33 556 
d 1 FREDO 44140 1.00 +1.0 1.000 +1.00 1.000 
е 1 БЕС +.5 .75 +.6 437 +.25 500 
f 1 ТҮЗҮ +.5 .75 +.5 902 —.25 167 
g 1 ЖЫЕП. +.5 .75 ubi 223 0 333 
h 0 zo 0 Z0 0 ES 074 —.50 0 
i 0 0 E 0 .50 —.4 115 —.08 278 
j 1 PIONS 4210, 1.00 +1.0 328 +.83 889 
k 1 ЕЙ +.5 -75 +.9 498 —.33 111 
1 0 0 .5 0 -50 i БГ — .33 11 
Set ки хе Р 

1 +.270 9.183 <.005 

2 +.305 9.583 <.002 

Б +.194 11.046 <.001 

4 +.191 9.847 <.002 

5 +.143 14.332 < .0002 


x? = 14.332 with 6 df, Р < (05. 
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u weights (given as иг in Table 2), and then using the latter in For- 
mulas 8 and 4, we obtain 


EC 
aes 100(.795 — .705)" — 9.583(P « .002), 


ae „58155 — .705° 
апа 
204795 — 705 _ 
ke = xS = +.800 


This set (2) of more articulated hypothesis weights results, in 
this instance, in slightly larger xw? and x» than the 0-1 weights. 
Greater articulation need not result in such increases, nor is it neces- 
sarily the case that, with a change of weights, xw? and ко will 
change in the same direction, as will be seen in the next example. 

An even further degree of articulation in the set of weights is 
illustrated in the us’ column of Table 2. These weights not only 
distinguish among the gross expectations of observed proportions 
greater than, less than or equal to chance-expected proportions, but 
also incorporate to some extent differences in degree. For these 
weights (transformed by Formula 1 to the us weights given in 
Table 2), x,? = 11.046 (P < .001) and x. = +.194. Note that 
there is an increase in xs? together with a substantial decrease in 
kw relative to the previous u sets. Although xw? is a significance test 
of x», unlike the product-moment r or other familiar measures of 
association, x,’s statistical significance does not necessarily increase 
аз it increases in magnitude, because of the operation of the hypo- 
thesis weights. 

We take the idea of fine articulation of the u values to its logical 
conclusion in the us’ column of Table 2, where we suppose that the 
investigator has mapped his expectations into the continuous in- 
terval from —1 to --1 quite freely. Note that throughout, we have 


` not corrected the “error” in anticipation of observed greater than 


chance-expected proportions for cells f, g, and k; we have merely 
been progressively refining the expression of the (partially wrong) 
composite hypothesis. For these highly articulated weights, xw” = 
9.847 (Р < .002) and xw = +.191. The increased articulation here 
results in a lower value for xw? than for the lesser articulation of 
Set 3, but a slight increase over that of Sets 1 and 2. (ко here also 
drops, but only slightly.) Therefore, greater articulation, as such, 
is of no necessary advantage. 
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In summary, in any study whose results are cast in the form of a 
table of proportions or frequencies, the investigator can express his a 
priori hypotheses by a set of weights which can reflect with high 
fidelity the strength of his theory and/or his convictions. Whatever 
the size and dimensionality of the table, the Хе? significance test is a 
one df y? test which reflects the fit of the data to the hypothesis, 
and he can index the degree of hypothesized association by means 
of x. Both these aspects of the analysis, the degree of hypothesized 
association as well as its statistical significance, are necessary to an 
understanding of the substantive issues (Cohen, 1965, pp. 101-106). 


xu? and 4 


In the illustrative example, it was seen that the “same” hypoth- 
esis, expressed by four sets of weights differing in degree of articu- 
lation resulted in varying values of Xw”. A useful question to pose is 
"What set of weights maximizes xw?” The answer to this question 
provides insight into the statistical strategy of x? and also gives the 
investigator a specific goal to shoot for in setting his hypothesis 
weights, 

The maximum value of xw? for any 2-way table is attained for the 
weights 


uus! = EL Pp. 
4i Di.D.i 9) 


or, for an m-way table (m > 2), generally, 


lias Cell observed proportion — 1, (10) 
e Cell chance-expected proportion А 
where the denominator is the product of the m marginal probabili- 
ties. Column шу in Table 2 gives these weights for the example, and 
Formula 1 maps them into the 0-1 interval (ug in Table 2). 
Please note that these weights are entered in Table 2 in a very 
different spirit from that of those preceding. These are not weights 
which an investigator is even Temotely likely to have stated a 
priori, but merely the ones which, post hoe, maximize x. Their 
use with the data results in x,? = 14.332 and Kw = +.143. 

The wu; of Formula 9 have a look reminiscent of 2 tests on fre- 
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quencies or proportions. If Formula 9 is substituted in Formula 1 
(to u) and those then used in Formula 8 and simplified, we obtain 


si и сх (pis = р:р.)? buses 
x Ур, a d 

Formula 11 is the standard formula for a x? contingency test ex- 
pressed in terms of cell and marginal proportions. Thus, the maxi- 
mum possible value which ҳи? can attain in any m-way table is 
that table’s conventional ха value. But if this is true, wherein lies 
the advantage of the x? procedure The answer is simply that x? is 
usually based on several or many df, eg, (r — 1) (6 = 1) = 
(3—1) (4—1) = 6 df in the two-way table of the example, while 
xu? is always based on one ај. The criterion value of xu? needed 
for significance at one df is smaller than that of x? with its multiple 
df. In the illustrative example, for instance, xo? = 14.332 at one df 
is at р < .0002, while x? = 14.332 at six df is at p < .05. 

Again we note that the maximizing weights are quite hypothetical 
—even with a strong theory, sampling error would insure the prac- 
tical impossibility of setting these (Set 5) weights in advance. But 
one can fall considerably short of the maximizing weights and still 
reap a considerable gain in the xw? test with one df. In Table 2, note 
that even the binary weights of Set 1, which are partially incor- 
rect, confer a considerable advantage in significance over the con- 
ventional x? with six df; all the weights illustrated are advantageous. 
All that is required is a set of hypothesis weights which give, over 
the cells, a moderage degree of correlation with the maximizing 
weights. 

A further consequence of the relationship between Xu? and x? is 
that no advantage to significance occurs in m-way tables where the 
x2 test is based on one df, eg, 2 X 2 or 2 x 2x 2 tables. No a 
priori weights can yield a хь? greater than x, and both are referred 
to the one df x? distribution. 

Finally, as one can see from the summary at the foot of Table 2, 
the кь value found for the y,.2-maximizing weights of Set 5 is +.143, 
the smallest yielded by any of the sets. Thus, unlike product- 
moment r, the correlation ratio т, the contingency coefficient C and 
other nonparametric measures of association, where (for constant 
N) increases in degree of association are invariably accompanied 
by increases in significance, and conversely, weights producing 
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large xw? need not produce large xw, and conversely.4 (Weighted 
analogues of these other measures would have the same property.) 
This disparity is inevitable, but it makes of ко а less than ideal 
measure. It is nevertheless a serviceable way of indexing the degree 
of hypothesized association. 
Directionality 

А positive value of x, indicates that the investigator's composite 
hypothesis, as expressed in the weights, tends to hold for the sample, 
a negative value that the opposite of the hypothesis tends to be the 
case. As with the product-moment r, or a difference between means, 
one may elect to perform one-tailed significance tests by accepting 
only positive к„ and either halving the tabled Р values in the one 
df х? distribution, or for “exact” values, looking up the upper tail 
area of the normal curve deviate 2 = + Vx, . Although the author 
has argued against one-tailed tests in general (Cohen, 1965, pp. 106- 
111), their protagonists would find them at least as defensible in 
the context of к„ as they are in others in which they are advocated. 


ко and the m-way Table 


The two-way contingency table is by far the most frequent in be- 
havioral research, and was therefore chosen for the example. It is 
clear, however, that no problem is encountered in extending xy? and 
кь to m-way tables for other values of т. For tables of any dimen- 
sionality, all that is required is that each cell have an a priori u, 
and, following data collection, an observed proportion and a chance- 
expected proportion. Just as in two-way tables, a cell’s chance- 
expected proportion is pip, the product of its observed row and 
column marginal propositions (as in Table 1), in a three-way table, 
a cell’s chance-expected proportion is the triple product of its ob- 
served row, column and layer marginal proportions, etc. 

Higher-order tables suggest further analytical refinements. A two- 
way table may frequently be thought of as having one independ- 
ent variable (e.g., diagnosis) and one dependent variable (e.g., em- 
ployment history). An m-way contingency table can, similarly, be 
thought of as having m — 1 independent variables and one dependent 

4The weights which maximize xe turn out to be of no psychological inter- 
est. ке is а maximum for binary weights assigned so that the cell where 


pu/p..D. із smallest is assigned 0 and all the other cells 1. The author is 
grateful to Joseph L. Fleiss of Columbia University for supplying the proof. 
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variable. Hypothesis weights can be written for each independent 
variable, analogous to anova “main effects.” These weights can be 
combined additively for the several independent variables, or pro- 
vision can be made for two-way and higher-order interactive effects 
by the further addition of interaction hypothesis weight components, 
in obvious analogy with the anova factorial design model. Thus, & 
system similar to that of information or uncertainty analysis (Gar- 
ner and McGill, 1956) or x partitioning (Bresnahan and Shapiro, 
1966), but with hypothesis weights which increase power, can be 
generated using the xw? framework. 

One-way tables can also be analyzed by xw”. Whereas in contin- 
gency tables, the chance-expected proportions are generated by mul- 
tiplication of marginal observed probabilities, in the one-way table 
the null hypothesis is expressed by an a priori set of chance- 
expected proportions dictated by the logie of the analysis. Thus, for 
example, to determine whether four alternatives are equally pre- 
ferred, the null hypothesis is one of equiprobability, so that each 
cell’s chance-expected proportion is .25. The investigator can then 
express his (alternate) hypothesis in the form of a weight in each of 
the four cells, and proceed to determine the observed proportions. 
Then хь? and ко can be found and interpreted as above. 


Statistical Power 


The advantage of xu? lies in those instances where an investiga- 
tor’s anticipations about the outcome of a research are substantially 
more right than wrong. In such instances, with the incorporation of 
the u weights, the probability that the null hypothesis will be те- 
jected for any given significance criterion by xw? is greater than 
when the conventional x? procedure is used. Thus, one would nor- 
mally expect the xw? test to have greater statistical power (Cohen, 
1965, pp. 95-101; 1969, Chap. 7). The extent of the power advan- 
tage of хи? increases with the number of df for the x? test and also 
with the degree of correlation of the hypothesis weights with the 
maximizing weights of Formulas 9 and 10. (As noted, this advan- 
tage in power is completely lost when the x? test has one df.) 

When the converse case occurs, i.e., when the conventional x? test 
yields greater statistical significance than yw”, the appropriate in- 
ference is that the investigator’s composite hypothesis, as expressed 
in the weights, is not more right than wrong, or not sufficiently so. 
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Since x? is large, an association obviously exists, but not of the king 
anticipated, so that a clear signal 18 provided that the Substantive 
issues require rethinking, 
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A LINEAR-PREDICTION APPROACH ТО 
DEVELOPING TEST NORMS BASED ON 
MATRIX-SAMPLING* 


DAVID J. KLEINKE 
Syracuse University 


| Тнв efficacy of using matrix-samplig for approximating the 
norming sample’s total-test mean and variance has been demon- 

strated in a number of studies (Cook and Stufflebeam, 1967; Lord, 

1962; Owens and Stufflebeam, 1969; and Plumlee, 1964). (The term 
“matrix-sampling” has been adopted here, rather than "item- 
gi sampling” because the sampling is indeed across examinees as well 
as across items.) Lord (1962) established the basic methodology, 
including the equations for using the part-test, or item-sample, 
statistics to estimate the total-test mean and variance. Except for 
Owens and Stufflebeam, all of the investigators have used essentially 
the same methodology. Examinees who had completed a full test 
were assigned post hoc to groups which were presumed to have re- 
sponded to only a sample of the items on the test. In the Owens and 
Stufflebeam study, item-samples were selected before testing; these 
items appeared first in specially prepared booklets. Then, in each of 
the four studies mentioned above, the estimated mean and variance 
for the total test were compared with the corresponding values ac- 


tually obtained by these same examinees. 

However, in only two of these studies (Cook and Stufflebeam, 
1967; Lord, 1962) were the total-test distributions also approxi- 
mated. This was despite the fact that the technique of matrix- 
MODI s ш. 

1Based in part on an Ed.D. dissertation in educational psychology and 
statistics completed at the State ‘University of New York at Albany under 
the. direction of Robert F. MeMorris and Robert М. Pruzek, co-chairmen, 
and James E. Powers. An earlier version of this paper was presented at the 
joint meeting of AERA and NCME, Minneapolis, March of 1970. 
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sampling is designed for test norming, and the value of having a 
basis for approximating percentiles in such a situation is obvious. 
In the two studies in which distributions were approximated, the 
negative hypergeometric distribution (H) was used to generate 
these approximations. With this discontinuous distribution, a “curve” 
may be fitted to either observed or estimated total-test data given 
only the mean, variance, and number of items (Keats and Lord, 
1962; Lord, 1960). In the current study, an alternate method is pre- 
sented, one using a linear-prediction (LP) approach for generating 
the estimated total-test distribution. 

This LP approach may be described briefly. Each examinee is 
presented with a sample of items from the total-test. On the basis 
of his performance on that item-sample, his score of the remainder 
of the total-test, the composite of items with which he was not pre- 
sented, is predicted. The sum of this predicted score and his ob- 
tained, item-sample score, is his predicted total-test score. The dis- 
tribution of the predicted total-test scores of all the examinees is 
the LP distribution. 


Theoretical Basis 


First, assume that the total-test, containing mn items, is randomly 
divided into m nonoverlapping item-samples each containing n 
items. Similarly, the norming sample of mN persons is randomly di- 
vided into m nonoverlapping examinee-samples of N persons each. 
The item-sample to which person i responds and the examinee- 
sample of which he is a member are designated as x. An examinee’s 
score is the number of items he answers correctly. Items are scored 
dichotomously and are non-differentially weighted. Now, the pre- 
dicted score of examinee i on the composite y of the n(M — 1) 
items to which he does not respond is 


Yt = pureed) y. (1) 


where #,., is a constant, X; is 78 score on z, X is the mean of the 
scores obtained by members of examinee-sample = on item sample z, 
and У is the estimated mean on y for examinees in т. This estimate 
is the sum of the means on the (m — 1) item-samples not 2 obtained 
by the members of the (т — 1) examinee samples not т. (Note that 
“prediction” and “estimate” are used in different senses and are 
denoted differently.) 
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Regression Prediction 


In (1), k,.. could be replaced by 6,.., which is equal to #,,(6,/8,), 
an estimated linear regression coefficient. (A derivation of b, is 
presented in Kleinke [1969, p. 30-32].) Then the standard deviation 
of the predicted y-scores would be 


Bye = РАДЕ (2) 


Using the equation for the variance of a composite of two parts, 
the estimated total-test variance for examinee-sample z is 


S = s^ + 8,7 + 28, 8,8, (3) 
while the variance of the predicted total-test scores obtained through 
regression-prediction is 

во. = 8,2 sy + др. 8,8. (4) 
Since the values of Y,’ are linear functions of (X, — X), for any 
linear-prediction method, т.’ = 1. Taking this, together with (2), 
(4) may be rewritten as 
Se^ = 8 + 5,6, گ8 ا‎ (5) 
When the variance of the predicted total-test scores is subtracted 
from the estimated variance of the total-test scores, that is, when (5) 
is subtracted from (3), 
Su = виси T 671 Te ا‎ (6) 


2-sum-Prediction 


An alternative approach for obtaining a value for k,, in (1) is 
to assume that an examinee will obtain the same standard score 
on the composite of items-not-taken as on the item-sample-taken, or 


PUR 
ИЕ OAL OG X. @) 
$, 8, 

This method would appear to be especially appropriate in the 
case where there are only two item-samples. Then, 8, would be that 
standard deviation observed when those persons not in examinee- 
sample д respond to those questions not in part-test z. When (7) 
is solved for Y,’, it yields 


y. ="(x,- 39 4 f. (8) 


8. 
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Equation (8) is the linear regression equation for the special case 
when f,, equals unity. This is not assumed here, but the implications 
of having f,, = 1 are explored, because this would lead to having 
the variance of the predicted total-test scores equal 


Se^ = 8 + 4,5 + 2s. (9) 

When (3), the estimated variance of the total-test scores, is 

subtracted from (9), the variance of the total-test scores predicted 
using z-sum-prediction, the difference is f 

Seu — Sic)” = (1 — fey) (28,6,). (10) 


This difference, for all positive values of f., less than unity, is a 
measure of the over-approximation of the variance of the total-test 
Scores as computed from (7) in the same manner that (6) is a measure 
of an under-approximation. 


c-Prediction 

A coefficient, here designated c,.., may be derived such that when 
it is substituted for Ё,.„ in (1) and the resulting Y’ values are added 
to the observed X values, the variance of the predicted total-test 
scores equals the estimated variance of the total-test scores. First, 
these variances are set equal, or 


Е (11) 
Since 8," is the mean of m estimates (Lord, 1962), (11) may be 
rewritten as 
Bet вв, F 2o, 8? = 32 + 8,7 21,54, (2) 
which forms the quadratie 


Уз +45 OR аз)‏ + 8 ے 


2 
u'z 8, Р 


Since the expression under the radical is &,2, 


diio) 
нат 1. (14) 
Equation (14) may also be derived by assuming that an examinee’s 
standard score on the total test will equal his item-sample standard 
score. 


` 
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Empirical Verification 
The methodology of Lord (1962), which was followed by Plumlee 
(1964) and Cook and Stufflebeam (1967), was employed to obtain 


an empirical verification of the application of LP following matrix- 
sampling. 


Method 


The examination used for this verification was a 100-item, five- 
option multiple-choice test of verbal ability that was a portion of a 
larger test that had been taken by about 167,000 12th grade pupils 
with college aspirations. Ten 10-item samples were selected randomly 
without replacement. Ten 105-examinee samples were selected using 
spaced random-sampling methods. The first answer sheet selected 
was placed in group 1, the next in group 2, and so forth. Total-test 
and item-sample scores were obtained through the use of an optical 
scanning machine and a CDC 3100 computer. This same computer 
was used to generate the item sample-estimated mean and variance 
and to obtain the distribution of c-predicted total-test scores, the 
LP distribution. The computational procedures for estimating the 
mean and variance were based on those presented in Lord (1962). 
To obtain the LP distribution, (14) was substituted for ky.» in (1) 
to predict Y scores for each examinee. The sum of each examinee's 
X and Y' values, his predicted total-test score, was then rounded 
{о the nearest integer. The distribution of the 1,050 examinees’ 
predicted total-test scores was the LP distribution. The H-generated 
distribution was obtained using an IBM 360, model 50, following 
the computational procedures of Lord (1960). The computer program 
for this is described in Kleinke (1970). 


Results 


The total-test mean actually obtained by the 1,050 examinees was 
51.03; the variance was 357.96. The estimated values, produced fol- 
lowing item-sampling and the application of the algebraic equiva- 
lents of Lord’s equations, were 50.74 and 359.51, respectively. The 
mean and variance of the LP-approximated distribution were 50.74 
and 359.74. The corresponding values for the H-generated distribu- 
tion were 50.74 and 359.51. 

The criterion distribution (that of the total-test scores actually 
obtained by the 1,050 examinees), the linear-predicted (LP) dis- 


NUMBER OF EXAMINEES 


80 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


tribution, and the H-generated distribution are shown in Figure 1. 
It can readily be seen that the LP distribution is quite jagged. This 
is due to errors of rounding. Note, for instance, that 22 of the 81 
raw scores in the 10-90 range have no examinees predicted to obtain 
them, while some adjacent raw scores are obviously overly fre- 
quently predicted. This results in a chi-square value for goodness- 
of-fit of 861.00, which, with 61 degrees of freedom, is equivalent to a 
t of 30.65. What is not as obvious on inspection of Figure 1 is that 
the H distribution is systematically different from the criterion 
distribution. The chi-square associated with it is 171.78, equivalent 
to a t of 7.54. 

This systematic inaccuracy is seen more readily on inspection of 
Figure 2, in which the data underlying Figure 1 are plotted as cumu- 
lative proportions. The H-distribution is above the criterion dis- 
tribution for score 0-27, 31, and 63-100 and below it for the others. 
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Figure 1. Number of examinees at each raw score point: criterion-, LP-. 
and H-generated distributions. ECIAM 
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Figure 2. Cumulative proportions of examinees: criterion-, LP-, and 
H-generated distributions. 


While the two distributions have the same mean (50.74) and vari- 
ance (359.41), the median of the H-distribution is 50.85, indicating 
very, slight negative skewness, and that of the criterion distribution 
is 49.76, indicating some positive skewness. 

The largest discrepancy between an approximated cumulative 
proportion and the criterion cumulative proportion was .035, for the 
LP distribution at a raw score of 68. Generally, however, the LP 
approximations were closer to the criterion values than were the H- 
generated approximations. The mean discrepancies for the LP dis- 
tribution were .003 (algebraic) and .010 (absolute), whereas those 
of the H-distributions were .004 and .013, respectively. The cor- 
responding means, when discrepancies are weighted for number of 
examinees at each raw score point of the criterion distribution, are 
001 and .012 for the LP distribution and —.003 and .015 for the H- 
distribution. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Discussion 

First, although such demonstration was not the purpose of this 
study, the technique for estimating total-test mean and variance 
following post hoc item-sampling has again been shown to be appro- 
priate. This is consistent with the findings of previous investigations 
(Cook and Stufflebeam, 1967; Lord, 1962; and Plumlee, 1964). The 
need for further empirical work with item-sampling done before 
testing and examinees responding to only item-samples, that is, 
extension of Owens and Stufflebeam (1969), is obvious and need not 
be belabored here. 

The systematic inaccuracy of the H-generated distribution was 
an incidental and perhaps important finding. Whether the differ- 
ence in skewness was due to the fact that the mean of the criterion 
distribution was above the 50 per cent difficulty level (n/2) and its 
median below (n/2) or whether it was due to the effect of chance in 
this multiple-choice examination, is open to further investigation. 
Because total score was number of items answered correctly, the 
criterion distribution had steep slope in the region around a raw 
score of 20, the “chance-score.” A better fit for similar data was ob- 
tained using Lord’s (1969) methodology with non-linear regression 
of true score on observed score (Lord, personal communication). 

Although the H-generated distribution had these systematic in- 
accuracies, it was not greatly less accurate in approximating the 
criterion distribution than was the LP distribution. This was be- 
cause of the essentially random discrepancies introduced by round- 
ing the predicted total-test scores to integers before forming the 
cumulative distribution. Presumably, if this rounding were not 
done, these diserepancies would be greatly reduced or even disappear 
altogether. But to do that would be to ignore the fundamental dis- 
continuity of the obtained total-test scores. 

Furthermore, neither the H-generated nor the LP distributions 
were highly inaccurate. In neither case was there а discrepancy as 
great as four percentile points. Greater and probably systematic 
errors can occur in the norming of tests because of the composition 
of the norming sample. It is precisely this difficulty that matrix- 
sampling seeks to alleviate. The presumptions are that (a) shorten- 
ing the task for examinees reduces the likelihood of having institu- 
tions exclude themselves from the norming sample and (b) such 
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institutions contain examinees whose mean score, overall variance, 
or both differ systematically from those of examinees in partici- 
pating institutions. 

It should here be noted that although linear prediction proved 
effective with these data, it is not claimed that the generalizability 
of the technique has been established. Further work, both empirical 
and theoretical, is indicated. As a first step, applying linear predic- 
tion with highly skewed distributions should be explored. 

At the same time, other potential uses of linear prediction using 
су а Should be investigated. For instance, if an examinee completes 
only part of a test battery and if a predicted overall score is needed, 
could the use of c,., be more defensible than the use of either regres- 
sion-prediction or what is here termed "z-sum-prediction"? This 
has implications beyond test construction and score interpretation. 
The relationship between a child's IQ and his later adult IQ, for 
instance, may be thought of as a part-whole relationship, subject 
to the same воть of prediction as approximating & total-test score 
on the basis of part-test performance. 

Since the value of c,., was the same for members of all 10 examinee 
groups in the current study, no empirical comparison of the pre- 
dicted and the obtained scores can be reasonably made. The need 
for an extension of this study, in which a unique value of Cvs 18 
employed for each examinee group, is obvious. Then the obtained 
and the predicted total test scores can be compared empirically. 
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SEQUENTIAL TESTING FOR DICHOTOMOUS DECISIONS! 


ROBERT L. LINN лхо DONALD А. ROCK 
Educational Testing Service 
T. ANNE CLEARY 
University of Wisconsin 


INDIVIDUALIZATION of instruction is generally considered an im- 
portant goal by both professional educators and laymen. This is true 
despite the frequently obscure meaning of “individualization.” In 
the broadest sense individualization refers to any educational adap- 
tation to individual differences, but more specifically Cronbach 
(1967) has identified five basic types of adaptation ranging in 
instructional flexibility from altering the duration for instruction 
by dropping students from the educational system (or at least the 
main-stream academic curriculum) to tailoring the method of in- 
struction to the individual. 

Tailoring the method of instruction assumes the existence of in- 
teractions between methods of instruction and levels of individual 
traits. Cronbach and Snow (1969) have recently reviewed the 
results of a number of studies designed to investigate aptitude- 
treatment interactions (ATI). Although the results to date are 
equivocal, Cronbach and Snow argue quite persuasively for the 
potential of this approach and the need for well-conceptualized 
ATI research efforts. 

Another type of individualization, which has received consider- 
able attention in recent years, is probably best exemplified by the 
work at the University of Pittsburgh’s Learning Research and 
Development Center on Individually Prescribed Instruction (IPI). 
IPI is based upon “well defined sequences of progressive, behavior- 
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ally defined objectives . . .” (Glaser, 1968, p. 5), and individualiza- 
tion is achieved by allowing students to progress at their own 
rate, 

Measurement of student aptitudes and accomplishment play a 
critical role in all types of individualization. From the point of 
view of individualized instruction, ^. . . testing and teaching are 
inseparable aspects and not two different enterprises, as one might 
be led to believe by current practices in education” (Glaser, 1968, 
р. 10). To be really adaptive individualized instruction must have 
frequent assessment of performance which provides the diagnostic 
information necessary to prescribe the next instructional segment. 

A major function to be served by these assessments is cate- 
gorization of students, rather than the continuous measurement that 
has been traditionally emphasized. For example, if the method of 
instruction is to be tailored to the individual, students need to be 
categorized according to levels of abilities and other learning 
characteristics. Or, if students are allowed to proceed at their own 
rate, assessment must place them into categories of students who 
have or have not achieved specified objectives. Given this type of 
placement decision, it is clear that the measurement problem is 
easier for individuals who are substantially above or below a cut- 
ting point than for those people who are quite close to that cutting 
point; and thus a saving in testing time may be obtained by 
sequential strategies, 

The purpose of this study was to investigate the potential ad- 
vantages of a sequential testing procedure for multidimensional 
dichotomous designs. Green (1970) has presented theoretical re- 
sults showing that a sequential testing procedure would result in 
а reduction of testing time by approximately 50 per cent of that 
required for equal accuracy by a conventional testing procedure, 
This result is based on the assumed use of items all with equal 
difficulty at a level appropriate for the decision, and all with equal 
discriminating power. The items were also assumed to be free of 
chance success due to guessing. 

The procedures for sequential decision making developed by 
the Columbia Statistical Research Group (1945) were used by 
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considerable savings in testing time can be achieved. Cleary, Linn, 
and Rock (1968a, 1968) have also used a generalized sequential 
procedure due to Armitage (1950) to route people into three and 
four groups ordered along a single continuum as part of a two- 
stage test. 

In the present study, existing item response data on three achieve- 
ment tests were used to compare the use of two sequential testing 
procedures with conventional testing for purposes of dichotomous 
decisions on each of the three dimensions. 


Procedure 


Item response data to three of the General Examinations of the 
College Board’s College Level Examination Program (CLEP) were 
available on magnetic tape for 4840 college students. The General 
Examinations of CLEP are intended to assess the examinees’ knowl- 
edge of basic facts and concepts in five areas of the liberal arts. 
The three General Examinations used in this study were English 
Composition, Mathematics and Natural Sciences. English Composi- 
tion and Natural Sciences are 100 items in length and Mathematics 
is 75 items. (For further information regarding the tests, see College 
Entrance Examination Board, 1967.) 

The sample was first divided into two random halves of 2420 
students. One sample was used for calibration purposes and the 
second sample was retained for purposes of cross-validation. 

The calibration sample frequency distributions of the three full- 
length tests were obtained and used to determine for each test а 
cutting score that divided the sample approximately in half. (The 
resulting number of people in the lower group were 1261, 1268, and 
1331 for English Composition, Natural Sciences, and Mathematics 
respectively.) 

Two sequential testing procedures were investigated: in the first 
procedure each dimension was treated independently while in the 
second the decision that was made for Mathematics was taken into 
account in the decision procedure for English Composition and 
Natural Sciences. The sequential assignment procedure in both 
cases, however, used a technique developed by Armitage (1950) 
which is directly applicable to more than two groups. 

The calibration sample was used to compute item difficulties in 
the high and low groups on each dimension. Let Рм be the pro- 
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portion of examinees in the high group on a given test who 
answered item $ of that test correctly. Similarly, let Ри be the 
proportion of examinees in the low group who answered item i 
correctly. Given Ри and Pu, the sequential testing procedure was 
as follows: (1) Items were scored in the order that they appeared 
on the test; (2) after each item was scored, a decision was made 
either to assign to one of the two groups or continue testing. A 
person was assigned to the high group following the mth item if: 


ут) = юв, > A, @ 


and he was assigned to the low group if 


vim) = X log R: < —A, @) 
where 


Е, = (Р,/ РА.) if the response to item $ was correct, ([1 — P,,]/ 
— Рију if the response to item $ was incorrect, and A is 
а constant. 

If neither (1) nor (2) held, then another item was scored. This 
Process was continued until an examinee was either assigned to 
& group or until 60 items had been scored. If no assignment had 
been made after 60 items, then the examinee was assigned to the 
high group if 


(60) > 0 (3) 
and to the low group if 


¥(60) < 0. (4) 


PUN yin сиве 
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Natural Sciences and English Composition. It was identical to the 
above procedure, except that it used the relationship between Math- 
ematics and the other two tests to compute a probability that an 
examinee was in the high or low group for Natural Sciences and 
for English Composition, given the group to which he was assigned 
on the Mathematics test. This information was then treated as 
the zero-th item and the sums in equations (1), (2), (3), and (4) 
started at = 0. 

Short conventional tests consisting of the first 5, 10, 15, 20, 25, 
30, 35, 40, 45, 50, 55, and 60 items were also scored for each test. 
Group assignment was made by identifying the score in the cali- 
bration sample that most nearly divided the sample in half. 

The sequential procedures and the short conventional tests were 
then compared using agreement with actual group assignment on 
the full tests in the cross-validation sample. 


Results 


The performance of the first sequential strategy can be seen in 
Figures 1, 2, and 3 which show the number of examinees in the 
cross-validation sample who were correctly assigned to the high or 
low group for Mathematics, Natural Sciences, and English Com- 
position respectively. The solid lines present the results for the 
short conventional tests. Observations were available at 5-item 
intervals and these points were connected by straight lines. The 
results for the sequential tests are given four levels of risk that 
were employed (A = 1.39, 2.30, 3.00, and 4.61) and again these 
points are connected by straight lines for the sequential tests, the 
abscissa corresponding to the average number of items used for 
assignment, since the number of items varies from one examinee to 
another. 

The dashed lines in Figures 1, 2, and 3 indicate approximately 
the number of items that would be required using the short con- 
ventional tests to equal the number of correct assignments achieved 
by each of the sequential tests. For Mathematics (Figure 1), the 
conventional test requires at least twice as many items as the 
average number of items required by the sequential test {ог the 
same degree of accuracy. For the Natural Sciences and English 
Composition the ratio of number of items required by the con- 
ventional tests to the average number of items required by the 
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„Figure 2. Number of examinees in cross-validation sample correctly as- 
signed for Natural Sciences by short conventional and sequential tests. 


items, and presumably in testing time, of 50 per cent would be of 
great value particularly in situations where dichotomous decisions 
need to be made fairly frequently. 

Table 2 lists the ratios of the number of items required for equal 
accuracy by the short conventional tests to the average number 
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‚ Figure 3. Number of examinees in cross-validation sample correctly as- 
signed for English Composition by short conventional and sequential tests. 


1.39 and A = 4.61 from th 


corresponding ratios listed in Table 1. 
For А = 230 and A 


= 3.00, the Natural Sciences ratios are 


Short Conventional 
Tests 
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TABLE 1 
Ratio of Number of Items Required by Conventional Tests to Average 
у Number Required by Sequential Tests 
for Equal Accuracy* 
| Level of A Achievement Area 
| for Sequential Natural English 
( Test Mathematics Sciences Composition 
1.39 2.17 2.67 2.08 
2.30 2.35 1.97 1.95 
3.00 2.69 1.79 1.845 
4.61 2.33 1.65 1.71 


* Based on linear interpolation. 
b Based on a linear interpolation between results for 60 items and perfect accuracy at 100 items. 


7 approximately .2 higher for the second type of sequential test 
than the first type. The assignment to & group was made earlier 
for the second type of sequential test than the first, but the ac- 
curacy was approximately equivalent or only very slightly better. 

The ratios in Table 2 for English are lower than their counter- 
parts in Table 1 for all four levels of A. The difference between the 
ratios is largest at A = 1.39 (the only instance where the ratio 

а js less than 1.0) and the difference decreases аз A increases. Thus, 
treating the assignment on Mathematics as the zero-th item re- 
sulted in a decrease in accuracy for English whereas the result is 
little change or a slight increase in accuracy for Natural Sciences. 

The efficacy of the second sequential procedure as compared to 
the first clearly depends very heavily on the magnitude of the 
relationship between the first and second set of items. The Mathe- 
matics test correlated .43 with the English test and .59 with the 


Р" 
| TABLE2 
Ratio of Number of Items Required by Conventional Tests to Average Number 
Required by Sequential Tests with Equal Accuracy Where Mathematics 
Assignment Is Treated as Zero-th Item* 
Level of A Achievement Area 
à for Sequential Natural i 
"Test Sciences Composition 
1.39 2.02 .90> 
2.30 2.22 1.52 
3.00 2.03 1.51 
| 4.61 1.61 1.58 


Zest rasa виро О 001990 о 


S Based on linear interpolation. 
^ i Based on linear interpolation between results for 0 items and chance accuracy at 0 items. 


94 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Natural Sciences test in the calibration sample. The amount of 
information provided by a variable with a correlation of .43 roughly 
corresponds to the amount of information provided by one good 
item. Thus, some slight advantage would be expected even for 
English if the actual Mathematics test score was available. In 
this study, however, only the assignment information on the Math- 
ematics test was assumed to be known. This dichotomous informa- 
tion, containing errors of assignment, proved to be of no value for 
English. In general, the second sequential approach shows little 
promise except for cases where the correlation between the tests is 
quite high. 


Summary and Conclusions 


Two sequential testing strategies were compared with conven- 
tional testing procedures with respect to efficiency in making dichot- 
omous decisions. It was found that the short conventional tests 
generally required approximately twice as many items to achieve 
the level of accuracy obtained by the sequential tests. These em- 
pirical results were in close agreement with earlier theoretical re- 
sults. Thus, it is concluded that sequential testing procedures should 
prove of considerable value in making multiple dichotomous 
decisions when testing time is limited. The second type of sequential 
test, which used knowledge of assignment on Mathematics in mak- 
ing decisions on the other two dimensions, had some advantages 
over the first type of sequential test for one dimension, but not 
for the other. The usefulness of the second type of sequential test 


depends heavily on the relationship between the two dimensions 
in question. 
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POLYNOMIAL APPROXIMATIONS TO THE 
CUMULATIVE FISHER DISTRIBUTION 


DAVID R. OWEN 
University of Missouri at Columbia? 


А special case of the Beta distribution named by Snedecor in 
honor of R. A. Fisher is of considerable importance in statistical 
inference. This two-parameter family of curves involves one ran- 
dom variable distributed as 


g(a; m, m) 


(+92) (m) п) (а) E + mz) 


, 7 | 
ПЕ Se: (1) 


The cumulative distribution refers to the area under the curve 
from zero to F, or 


(m+n) /2 


G(F) = f gla; m, n) dz, Q) 


and this definite integral becomes of primary practical concern. 
Although (2) may be integrated by parts and evaluated by sum- 
ming a series of terms, the task grows undesirably tedious with a 
calculator or desk computer. Consequently, upper percentage points 
have been tabled (Merrington and Thompson, 1943), and the 
experimenter must be content knowing only the range in which 
an observed F ratio falls. In an attempt to provide other per- 
centage points, Paulson (1942) suggested а transformation which 


1 Requests for reprints should be addressed to the author at Educational 
Testing Service. 
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distributes nearly normally as the degrees of freedom become large. 
Unfortunately, that statistic encounters a relative error of several 
percent for the smaller degrees of freedom. 


Rational Approximations to G(F) 


Table 1 contains eighth degree polynomials which behave “very 
much” like the actual cumulative Fisher distribution (2) for 
selected m and n. Thus for given df, the appropriate polynomial 
may be evaluated with comparative ease, giving a more precise 
cumulative probability for an observed F ratio. These approxima- 
tions were constructed after the suggestions of Hastings, Hay- 
ward, and Wong (1955). 

Determining the cumulative probability of a given F requires 
(a) selection from Table 1 of the appropriate column based on the 
ај, (b) verification that Р is between Fa, and Fmax, (c) defining a 
variable U by evaluating 


ded aF 
ит @ 


where U will range from zero to one, and (4) evaluating the 
polynomial, 


G*(U) = c + aU + eU ar aU’ + oU + eU? + с“ 


HGU’ + cU’, 
in U. A numerical example is given at the conclusion of the paper. 
Horner’s rule reduces the number of arithmetic operations from 


44 to only 16, so the polynomial should be evaluated (working 
outward from the innermost parentheses) as 


G*(U) = (((((c)U + c)U + 06)U + e)U 


+ чб + чи + с)0 + с)0 + с. (4) 

With the coefficients known, an analysis of roundoff error leads 

to a bound of 5 Х 107, substantially less than the error in the ap- 

iege d when seven digits are retained to the right of the dec- 
imal. 


Error Curves for G*(U) 


Subtracting the definite integral (2) from the approximation 
(4) for every point in the range gives rise to error curves having 
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equal absolute valued extremals of alternating sign (levelled within 
1.6 per cent of the smallest extremal) which, as suggested by Hast- 
ings et al. (1955), is an attribute of “best fitting” rational approxi- 
mations. Thus, it may be possible to construct other eighth degree 
polynomials which, over the same range, would come as close to 
(2), but none should have a smaller error bound. Error curves for 
the approximations in Table 1 may be sketched by noting addi- 
tionally that all begin with a positive extremal at U = 0, end with 
а negative extremal at U = 1, and have nine roots within the fol- 
lowing ranges on U: .008 < тз < .009; .070 < тг < .079; .185 < та 
< 208; .335 < т. < 377; 501 < Ts < .560; .660 < rg < .730; 
497 < тт < .866; .902 < rg < 954 ; and .973 < ry < .994. 


А Numerical Example 


As an example of the use of these rational approximations, con- 
sider an observed F ratio of 2.5 with 3 df for the numerator and 40 
for the denominator. The cumulative probability is found by 
(a) selecting the column headed 3, 40 in Table 1, (b) checking that 
F = 2.5 is between Fmm = 0.80228 and Рах = 9.5, (c) substitut- 
ing values for ao, ал, and b, together with the observed F into 
(3), obtaining 


Uim 11.2898657 -+ (— 14.0722262) (2.5) 
1.0 + (— 12.9890825) (2.5) 7 
or U = 0.7590926. Then (d) substituting the values for the c; 
together with U into (4) gives 
G*(U) — ((((((11.3272904)0.7590926 -+ (—39.3229486))0.7590926 
+ 54.4789835)0.7590926 + (—39.3905709))0.7590926 
+ 15.7870610)0.7590926 ++ (—3.3164402))0.7590926 


curve root locations, rg and Тт, where the error is positive. Therefore 
we can be certain that the exact cumulative probability lies be- 
tween G*(U) — (+0.00010) and the calculated G*(U), ог be- 
tween 0.92674 and 0.92684, 
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A NOTE ON THE EXPECTATION OF THE F-RATIO 


GARY C. RAMSEYER 
Illinois State University 


Ir is becoming increasingly more common for textbooks in psy- 
chological statistics to claim that in а simple ANOVA under a 
true Ho of equal treatment effects the expected value of F = 
MSo/MSw is unity (Myers, 1966, Winer, 1962). The writers in 
presenting their argument first show that 

Е(МВо) = c + па (1) 
where 

с^ = error variance 

с,2 = variance of the treatment effects 

п = number of observations in each treatment group 
and 


E(MSw) = c. (2) 
Then since 
MS 2) _ Е(М80) 3 
(е = EMS)’ e 
it follows that 
ма) ме @ 
Sw. о, 
and if Но: са? = 0 is true, 
2 
x( M82) ал: (5) 
Sw. с, 


The fallacy in the above argument lies in step (3) which essen- 
tially stipulates that the expected value of the ratio of two inde- 
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pendent random variables is equal to the ratio of the separate 

expected values. It is the purpose of this paper to show that, in 

general, the expected value of F = MSg/MS, is greater than 

unity under Но. 

"Theorem: Let X be a positive random variable with finite mean. 
Assume that X is not identically equal to a constant. Then 


1 1 
Before proving this theorem a simple illustration of its truth 
is offered. Consider the following discrete probability density 
function 


1 
а, E 
0 


= 0, elsewhere. 
Now 
EG) = Daf) 
= w(i) + eX) +o) 
=2. 
Therefore, 
Тра 
Ес) 2 
Оп the other hand, 


sit 
18 
Hence since 11/18 > 1/2, E(1/z) > 1/Е (2) for this example. 
Proof of the Theorem 


This derivation is an expanded presentation of a proof which 
m only upon the convexity of the function y = 1/х (Fleiss, 


== 


OCC tt 
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Let f(z) = 1/z, where z > 0. Also let @ = E(x) and let g(x) be 
the equation of the straight line passing through the point [6, 1(0)] 
and having slope Ј (6), where f/(0) is the first derivative of f(z) 


with respect to z evaluated at z = 6. Now since f(x) = —1/7^ 
/ (0) = —1/@. Then employing the slope-point form of the equa- 
tion of a straight line we have 


f = -ġe 0.‏ — و 
Since f (0) = 1/0 the above simplifies to‏ 


17H 2 2 
o = 5- #270 707: 
Next let q(x) = (z — 0)? = 22 — 200 + 62. The function q(x) will 
be zero if and only if х = 6; otherwise it is positive. Hence, 
2 — 220+ 0 > 0 
or 
8 > 220 — 2°. 
Dividing the above inequality through by 26? yields 
1 80 
276 й 
or f(z) > g(x), where the equality holds if and only if z = 6. Now 
consider h(x) = f(x) — g(z). Since f(z) > g(x), h(z) > 0 and 
thus E[h(z)] > 0. But 


вм = a(t - 2 +) 


=x) ++ 
ТЕЕ 
9-1 
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and hence 


or equivalently 


в) > z5 


where equality holds if and only if = @ = E(x). Since т was 
assumed not to be identically equal to a constant, the inequality 


is strict. That is, 
1 1 
X)» aay 


Corollary to Theorem: Let У be another positive random variable 
independent of X and with finite mean. Then 


22) > E: @ 


Proof of the Corollary 


1 1 
К) 
E(X) 
Since Y is positive, E(Y) is positive. Thus multiplying both sides 
of the above inequality by E (Y) yields 


весу > Ar 


Now since Y and X are independent, Y and 1/X are independent 


and we have 
қо) Bp 


x)» £m E(X) 


Clearly under a true Н о, E(MSa/MSy) is an application of this 


From the theorem, 


<. 


— 
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corollary since Ме is independent of MSw and both are positive 
random variables. Thus у 
Е (М 5 я) с. 


М8 
их 
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CORRECTIONS FOR ATTENUATION? 


CHARLES E. WERTS Ax» ROBERT L. LINN 
Educational Testing Service 


For many theoretical purposes an investigator may be interested 
in the correlation between two latent traits. His measures of these 
traits, however, will typically be fallible. The correlation between 
the two fallible measures will generally be less than the desired 
correlation between the two latent traits providing that the errors 
of measurement are uncorrelated. The standard procedure for cor- 
recting for this attenuation is to divide the correlation between the 
two measures by the square root of the product of their reliabilities. 

The purpose of this paper is to discuss the application of Jére- 
skog’s congeneric test model (1968) to the above problem. The use 
of information obtained from part scores (e.g, split halves or 
thirds) assuming only congeneric measures will be demonstrated. 
An application when certain errors of measurement are correlated 
will be given. Tests of the models will be indicated and finally a 
procedure for computing the whole test reliabilities making only 
congeneric assumptions about the parts will be discussed. Since our 
concern is with the models and logic of the procedures, the dis- 
cussion will be in terms of the population parameters except when 
estimation is mentioned explicitly. 


Basic Principles 
If test scores Ху, Xs, ::* Xm are congeneric, then they can be 


represented by X, = Ay + ВХ, + Е, (6 = 1,2,5 т) where E, 
is uncorrelated with E; for alli 7& j and Xr is à random variable 


1 The research reported herein was performed pursuant to Grant No. OEG- 
1-6-061830-0650 Project No. 6-1830 with the Office of Education, U. 8. De- 
partment of Health, Education and Welfare. 
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(see Jéreskog, 1968, equation 3). As Jéreskog indicates, the В; and 
the error variances can be estimated using a factor analytic model 
with one common factor. He goes on to demonstrate the usefulness 
of his restricted maximum likelihood factor analysis for estimating 
the parameters for several sets of congeneric test scores. 

The implication of Jéreskog’s model in correcting for attenuation 
can be seen from a study of the case in which the two tests are split 
into halves and the correlation between these halves is used in the 
Spearman-Brown formula to obtain the reliability of each test. 
This case is depicted in Figure 1 where (а) X, and Х,, Y, and У, 
are the split halves of test X and Y, respectively; (b) Xr and Yr 
represent the true scores underlying the tests; (c) ex,, ех, €r, r, 
are the errors of measurement which are assumed to be independent 
(denoted by the lack of arrows connecting the e’s) of each other 
and of both true scores Ху and Yr; and (d) the two arrows pointing 
at each observed score indicate that the score is determined by the 
underlying true score and the errors of measurement. Тће problem 
is to find the correlation between the true scores, Rx, y+. The Spear- 
man-Brown formulas assume that the split halves for a test are 
completely equivalent, i.e., they have the same mean, variance, and 


Figure 1. Model for split halves correction for attenuation. 


m. 
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reliability. But what if in fact the halves are not equivalent but are 
in fact congeneric? 

Examination of Figure 1 shows that four between test correlations 
Ry, y.) Ex, y, Бхато Ex, v, are available, whereas the usual approach 
only employs the total correlation between tests. This additional 
information may allow for computation of Rrrrr making only 
congeneric test assumptions about the two halves of a test. Con- 
generic tests measure the same trait except for errors of measure- 
ment, which means that the tests may have different reliabilities, 
means, variances, and units of measurement. The reliability of each 
test half is the square of the correlation of the observed score with 
the true score, e.g., the reliability of the split half of X, equals 
(Rx,x;)*. From true score theory six equations may be derived 
from Figure 1, corresponding to the arrows connecting the respective 
variables: 


Ех,х, = Rx,xrRx,xr, (1) 

Ry,y, = Ёү,үтЁү,үт› @ 

Ку, = Ёх,хтЁхтүтЁү,үт› (3) 

Култ, = Вх. х+ЮхтутВ ут) (4) 

Ку = Ёх,хтЁхтүтЁү,үт)› (5) 
апа 

Кх,у, = Ёх,хтЁхтүтЁү,үт* (6) 


Equations (3) through (6) should be recognized as the traditional 
correction for attenuation. The first thing to notice is that there are 
six equations for only five unknown correlations. This excess of 
information is useful in testing the model as follows. We can solve 
for Rx, y, in two different ways: 

(а) from equations 1, 2, 3, and 4: 


2 _ Rar Bar, 
Exere E Rx,x,Ry,r, T o 


and (b) from equations 1, 2, 5, and 6: 
б: Ех,т.Вхьть, (8) 
Rrrrr Ех,х,Ву,т, 


If the model is consistent, equations (7) and (8) would yield 
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identical results. It follows that the model requires from equations 
5 and 6 that: 


Ех, у.Юх.т, = Ёх,ү,Ёх,ү,. (9) 


But equation (9) is a Spearman “tetrad difference” for which there 
is a significance test (Holzinger, 1930). The implication of a signifi- 
cant difference is that one or more of the assumptions in the model 
is violated, e.g., the errors of measurement may be correlated. In 
this event application of any of the usual corrections for attenuation 
cannot be justified since the assumptions underlying such corrections 
are not satisfied. 

In summary, Jéreskog’s congeneric test model (1968) shows that 
the usual correction for attenuation procedures based on split half 
reliabilities neglect useful information about the correlation of parts 
of one test with parts of the other test. Although our example con- 
sidered the case of split half reliability, the principle applies equally 
well to reliability coefficients derived from item data, in which case 
the correlations of items between tests are neglected. 


Model Building 
By way of generalizing the above principle, consider the case 
where the tests are split into three parts (congeneric tests may differ 
in number of items). Figure 2 shows the corresponding model. 
In essence, three equations may be obtained to compute the 


reliabilities of three congeneric measures (see Lord and Novick, 
1969, section 9.12). 


Rx,x, = Rx,xrRx,xr, (10) 
Rxx, = Rx,xrRx,xr, (11) 
Rxx, = RxyxrRx,xr. (12) 
Solution of these equations yields: 
the reliability of 
X= Ра: a Rx,x,Rx,x, (18) 
Rx,x, 
the reliability of 
х, = Бура, = enn, у (14) 


Х.Х. 


е. 
3 3 
Figure 2. Model for split thirds. 


and the reliability of 


oe ae (15) 
Ёх,х, 

тп а similar manner the reliabilities of the three parts of test Y 
can be computed. When the equations are solved in this order, the 
only unknown left in the system would be Ry,r, but we have nine 
cross test correlations. Since each cross test correlation in this model 
is а function of the respective reliabilities and Ёхут; (6. Rxivi = 
RroxrBxrrrRr, rr), nine estimates of Rxrrr can be derived, i.e- 
eight different tetrad differences must be satisfied. Figure 2 cor- 
responds to a factor analytic model with two correlated nonover- 
lapping-group factors (independent clusters). As Jéreskog (1968) 
shows, his restricted maximum likelihood factor anslysis may be 
used to obtain estimates of the parameters and a large sample 
likelihood ratio test of the goodness of fit of the model given multi- 
variate normal assumptions. One question to be faced is why bother 
with split halves or thirds, why not use the item data directly? 
However, in some tests easy items might represent memory ог 
computational skills and harder items reasoning powers 8o that the 
assumption that all items are measuring the same underlying factor 
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is dubious. The researcher might still justify analysis at the item 
level by arguing that his substantive theory calls for concern with 
whatever factor the items have in common. This is definitely not a 
statistical issue but a matter of construct validity. Even though 
not factorially pure, substantive interest may be in the scores com- 
bining easy and hard items, e.g., because this complex of skills has 
construct validity relative to the study of learning rates of a similarly 
complex task. 

An important implication of the congeneric model is that the parts 
of a test need not have the same units of measurement which means 
that items may be included from very different sources, e.g., а 
“test” of achievement might include items derived from observation 
of relevant learning situations. True score factors derived from the 
commonalities among items of greater diversity may possibly be 
of greater substantive interest (ie., more valid) since the diversity ' 
helps minimize the error of ascribing to trait factors, common 
variance really due to methods factors. To illustrate, consider the 
case of two tests each with three parts, in which (a) X, and Х, 
were paper and pencil components with similar item format, (b) Хз 
was derived from unobtrusive observation of behavior, (с) У, and 
У, were also paper and pencil components with similar item format, 
and (d) У, was derived from unobtrusive observation by a different 
observer and under different circumstances, such that its errors of 
measurement are likely independent of the errors of measurement in 
Хз. Observations (а) and (c) suggest that the errors of measurement 
among the four paper and pencil components are not independent 
because of similarities of item format and/or general response 
tendencies associated with paper and pencil tests. The fact that 
X; and Y; are obtained under quite different circumstances lends 
credence to the initial assumption that the errors of measurement 
in these measures are approximately independent of each other 
and of the errors of measurement on the paper and pencil com- 
ponents. Especially with correlated error terms, it is helpful to 
draw a visual diagram, e.g., in Figure 3 the correlated residuals 
corresponding to observations (a) and (c) are depicted by double- 
headed arrows, the lack of which indicate independent errors. 

The diagram indicates which equations no longer hold, e.g. 
(a) equation (10) does not hold now because some of the correlation 
between X, and X, arises from the common method instead of 


Figure 3. Model with specified nonindependent errors. 


solely from the common factor Xr, (b) for the same reason Бук 
cannot be assumed equal to Ey, ут у, ут) (c) the correlations Буто 
Ёх,ү„ Вх,у. Ray, no longer are a simple function of the unat- 
tenuated correlation Rx, r, and the respective reliabilities. A useful 
rule in dealing with models involving correlated errors is to first 
analyze correlations involving independent residuals, i.e., correlations 
among variables in Figure 3 which do not have two headed arrows 
between the corresponding error terms. Furthermore the diagram 


immediately indicates the components of each correlation, е.&.: 


Rxz = Ех, х:Вх.хт: (16) 
Вх,х, = Ёх,хтЁх,хт› (17) 
8) 


Ку, = Ёү,үтЁү,үт› 


Ку. = Ry,y Ev. (19) 
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Rx,y, = Ех, хзВхгу: Ву, тт, (20) 

Вх,у, = Ex,x xov Ву, уг, (21) 

Ех, у, € Ех, х.Вхту:Ву, рт) (22) 

Вх,у, = КухаВх ry Ву уп) (28) 
апа 

Ёх,ү, = Rr,xrRxryrRy,rr. (24 


Solution of equations (16) through (24) yields three different 
estimates of the unattenuated correlation Вх. тг: 


В _ Вх, Вх, у. 
А а Rx,x,Ry,y,’ 


(25) 


Буљуг = к, (26) 
Хх Yes 


апа 


Ry.y; = A ый» es (27) 
X.x,Hy,v, 
A fourth estimate Ёх,у,* = (Rx,y,Rx,y,) + (Rx,x,Ry,y,) can be 
derived from the other three. The two zero tetrad differences implied 
by this model can be derived directly from these equations: 
(a) from (25) and (26) 


Rx, y.Ry, y, = Вх, у, то 
(b) from (25) and (27) 
Rx, ү,Ёх,х, = Вх, ү,Ёх,х,. 


Since our interest is in the unattenuated correlation Rx,y, the 
solutions for the reliabilities will not be shown (eg, Rrrr = 
Rx,yRx,y, + Rx,x,Rrzy;’), however, it is interesting to note that 
after the reliabilities are computed the contribution of methods 
factors are easily obtained, e.g., the contribution between X, and Xs 
is (В: — FEx,x Ex,x,). Whereas the model in Figure 2 yielded nine 


in Figure 3 yields only three, the difference arising from the six 
correlations between residuals. If X; and Y, had nonindependent 
errors because the same observer was used, then equation (24) would 
be untrue and only two different, solutions would be found. Finally, 
if X, and У, were not obtained and if X. запа У, were nonindependent 
the only valid equations would be (16), (18), (20), and (22) which 
would yield the unattenuated correlation by equation (25) even 


"S 
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though none of the reliabilities could be computed. The standard 
test theory assumption that the underlying true scores are uncor- 
related with errors of measurement is represented by the lack of 
arrows from the error terms to the underlying true score. This 
assumption would not be true if the method response tendency were 
itself nonindependent of the true score factor, a problem which 
Jóreskog (1970) considers in relation to the analysis of multitrait- 
multimethod matrices. 

Although the model in Figure 3 is a factor analytic model as are 
those in Figures 1 and 2, it is a special case in which there are no 
unique or residual scores. More specifically, in factor analysis the 
basic model is У = AX + Z where У isa vector of observed test 
scores, X is a vector of latent common factor scores, and Z is a 
vector of unique scores. For the specific model in Figure 3 and in the 
general case of correlated errors, the error factors should for estima- 
tion purposes be considered as common factors. In our example, the 
vector of factors is X = (Xr, Ут, €x. €x» ex. бум €» ey,), and the 
vector of observed scores У = (Xi Xa X» У,, У», Үз). To use 
Jéreskog’s (1969) confirmatory factor analysis the researcher must 
specify A, the matrix of factor loadings and Ф, the matrix of factor 
variances and covariances. In our example: 


Гв. 0 100000 
в, 0 010000 
Ka vee tae wae 
û BOO ov 19 00 
о B, 000010 
Lo By, 000001 
г. b 
* 1 gymmetric 
0 0 * 
gE OO 18 
оо о о * 
00 + * 0 * 
0 оба cto 7% 
lo000000 *1 
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where * are “free” parameters to be estimated by the prog 
whereas all other parameters are “fixed.” A 

The diagonal of Ф corresponds to the variances of the corres 1 
factors in X, therefore the ones mean that Ху and Yr have (foi 
convenience) been standardized and the six error variances are { 
be estimated. The off-diagonal (covariance) elements of Ф have 
"free" parameters corresponding to the six correlations among errói 
factors in Fig. 3. In addition to estimating the "free" parame 
in Ф, the program will estimate the six regression weights in A. 
algebraic explorations in the previous paragraph mean that 
solutions provided by the program will be unique, 1.е., all li 
transformation of the factors that leave the "fixed" parame 
unchanged also leave the “free” parameters unchanged. 


Computing Whole Test Reliabilities " 


| 

Given that congenerie assumptions are consistent with the data, 
it is then possible to use the coefficients of the model to compute the 
reliability of the whole test. Translating Lord and Novick’s formula 
(1968, pg. 85), the whole test reliability is: N 


(E Вано) 
=== ‚ 


ox 


2 
Ехх = , 


where 
уха“ із the reliability of test part i, 
ох = variance of the observed total test scores, 
ox, = observed variance of test part û. 


This formula requires weaker assumptions than Cronbach’s A 
coefficient because it does not require that the test parts be “е 


are equal, Kristof (1969) provides a test of this requirement. When | 
measurement are correlated as in Figure 3 then equation | 
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(28) becomes а measure of how well the total score measures the 
underlying true factor, however, this factor is no longer defined by 
the commonality of the three parts as in the congeneric model. 

While formula (28) provides a means of computing the whole 
test reliability it can be seen that the obtained reliability may differ 
depending on how many parts the test is split into, since the true 
factor underlying split halves may be different from the factor de- 
fined by item communalities In the absence of substantive theory 
specifying which factor is relevant, no specification of the relevant 
reliability is possible. 


REFERENCES 


Holzinger, K. J. Statistical resume of the Spearman two-factor 
theory. Chicago: University of Chicago Press, 1930. 

Jéreskog, К. G. A general method for analysis of covariance struc- 
tures. Biometrika, 1970, 57, 239-251. 

Jéreskog, K. G. Statistical models for congeneric test scores. Pro- 
ceedings, 76th Annual Convention, APA, 1968, pgs. 213-214. 

Jéreskog, К. С. A general approach to confirmatory maximum likeli- 
hood factor analysis. Psychometrica, 1969, 34, 183-202. 

Kristof, W. Estimation of true score and error variance for tests 
under various equivalence assumptions. Psychometrica, 1969, 


489-508. 

Lord, F. M. and Novick, M. R. Statistical theories of mental test 
scores. New York: Addison-Wesley, 1968. E. 
Novick, M. R. and Lewis, C. Coefficient alpha and the reliability 

of composite measurements. Psychometrica, 1967, 32, 1-13. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1972, 32, 129-136. 


ESTIMATES OF COEFFICIENT ALPHA FOR 
FINITE POPULATIONS OF ITEMS 


KEN SIROTNIK 
University of California, Los Angeles 


Тнв notion of well defined, finite populations of items has re- 
ceived increasing interest over the past decade. For example, the 
literature on item sampling (Lord and Novick, 1968) presents for- 
mulas for estimating examinee score variances for matrices sampled 
from known and finite examinee and/or item populations. Loevinger 
(1965) has criticized item sampling theory in general indicating 
that (a) infinite populations of items simply cannot be catalogued 
and therefore (b) statistically random item samples are not possible. 
The key word here is infinite. Given that one can accept a cata- 
logued (written down and numbered) set or population of M items 
as defining the particular construct in question, one can make in- 
ferences about this population from a randomly drawn item sample 
of size т. It only remains, then, for one to select a measurement 
theory, e.g. classical test theory, Guttman theory, Rasch theory, 
one’s own theory. If the sampling fraction m/M is large enough, 
and if it plays a statistical role in the theory, then the use of finite 
or infinite formulas can lead to q ite different results. This paper 
attempts to investigate implications for finite and known item 
populations for one kind of measurement theory and one kind of 
statistic, viz., classical test theory and the alpha coefficient among 
items in paper-and-pencil testing. 

In the Cronbach, Rajaratnam, and Gleser (1963) article presenting 
a general framework for reliability theory, the KR-20 internal 
consistency formula, or more generally coefficient а (Cronbach, 
1950), was shown to be valid under more relaxed assumptions 
than those used in the original derivation. Using the Hoyt (1941) 
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analysis of variance formulation and the Cornfield and Tukey 
(1956) method of deriving expected values of mean squares, a 
was shown to be valid assuming only random and independent 
sampling of examinees, items, and examinee-item responses (the 
response of a given examinee to a given item) from their correspond- 
ing populations. Cronbach, et al. (1963) assumed the examinee and 
item populations to be infinite. Although not explicitly stated, 
they also assumed the examinee-item populations (the population 
of responses of a given examinee to a given item) to be infinite, ex- 
tracting one random and independent sample from each. In this 
paper, finite sampling formulas for а will be derived and conceptual 
problems relating to the treatment of the examinee-item populations 
will be discussed. 


Formulation 

Consider a random model, n X m factorial design with r replica- 
tions in each cell, where the n levels correspond to a random sample 
of п examinees from a population of size N, the m levels correspond 
to a random sample of m items from a population of size M, and 
the г replications correspond to г examinee-item responses for each 
examinee and item sampled randomly and independently from popu- 
lations of size В. From only these sampling assumptions (all sam- 
pling being done independently) the expected mean squares 
(E[MS]) shown in Table 1 can be derived based on the following 
linear and nonadditive model (replication index is ignored) : 


Xi =e tds toy + ма + си (1) 
where 
X;; = examinee-item response of examinee 7 to item j 
ш = general level effect equivalent to the population mean 
т; = item j effect in the population of items 
A; = examinee 2 effect in the population of examinees 


Nr; = interaction effect between examinee 7 and item j in the 
population 

са = residual or error effect remaining in the population. 

As is usually the case with data collected on paper-and-pencil 

tests, т = 1 and the E[MS] of Table 1 take a slightly different 
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TABLE 1 
General Form of the E[M S] for the Two-Factor Design 


Source E[MS] 
E (1 —т/Ё)еа +r — m/M)ns? + rmo 
I (4 — r/R) + r(1 — n/N) туу? + Mo? 
EI (1 — r/R)e? + Toys? 

Error (1 = r/R)oê 


form (see Table 2). Finally, Table 3 presents the E [MS] for 
various combinations (cases A through F) of values for R and M. 
Cronbach, et al. (1963) deal exclusively with the combination 
where both В and M are infinite (case D). А measure of internal 
consistency is defined as the ratio of expected variance of the ex- 
aminee total score effect (т) to the total expected examinee 
variance (E[MS,]) in the population. Using MS estimates, the 
numerator is estimated by MSz — Мару; the ratio, then, is esti- 
mated by 
а = (М8 — MSs1)/MSz, (2) 
which is equivalent to the Hoyt (1941) formula. It should be 
clear from Table 3 that the numerator (and thus the ratio) can al- 
ways be exactly estimated when M is infinite, regardless of how В 
is conceptualized (cases D, E, and F). When M is finite (cases A, 
В, and С), however, the numerator (and ratio) can be estimated 
exactly only when R is finite and equal to one (case C). Otherwise, 
exact estimates for the numerator can not be computed; the ratio 
can be conservatively estimated, however, as shown below. The 
finite а formula will first be derived; implications regarding the 
treatment of В will be discussed subsequently. 
Derivations 
Inspection of the E[MS] of Table 3 for case С shows that the 


numerator of a can be estimated by MSz — (1 — m/M)MSzr; 


TABLE 2 
The Е(МЯ8] of Table 1 whenr — 1 
Source EMS] 
E (1 = бг + @- тјМ јаја + тах 
1 а – ива + 1 > n/N) о? + no? 
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TABLE 3 
E[MS) of Table 2 for Indicated Values of M and R 
M (and N) finite 
Source E[MS] 
А. R infinite 
E oè + (1 — m/M)oy,? + moy 
1 cê + (1 — n/N) ex? + nor? 
ЕТ сё + ом“ 
B. R finite; R > 1 
Е (1 — 1/R)o2 + (1 — m/M)ey,? + тај 
1 (1 — 1/R)e? + (1 = n/N) or? + пао 
ЕТ (1 — 1/® са + сут? 
C. R finite; R = 1 
Е (1 — m/M)ey + moy? 
Т, (1 — п/ Аја + nox? 
EI туу? 
M (and N) infinite 
Source E[MS] 
D. R infinite 
Е сё + оу? + moy 
й và + вх + по, 
ЕТ a? + 0,7 


Е. R finite; R > 1 
E 


1 
ЕТ 


(1 — 1/R)e? + оу" + тоу? 
(1 — 1/R)o? + ey! + пс? 
(1 — 1/Ё)с4 + 0,2 


ом? + ont 
2+ се 
rx? 


for this case, then, a can be estimated by 


a = [М8 — (1 — т/М)М8,/М8,. 
In the infinite case D, the familiar computational formula for (2) is 


where 


а = [т/(т = 1)](1 — У s,7/s,?) 


8; = variance of item j 


з." = variance of examinee total scores. 


(3) 


(4) 


To derive an analogous formula for the finite case, we proceed as 
follows: Substituting the equalities (see Sirotnik, 1970, for proofs 
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| of these equalities) 
MS; = па [m(n — 1) (5) 

MSer = (тт — ns? /m)/(n — 1)(m — 1), (6) 

where 5,2 is the average item variance, into (3) we have 
па binor Са т/М)пт; — (1 = m/ M)nsz /m 
_ min — 1) (n = D(m — 1) 
тз, 
m(n — 1) 
| Мт — туар SOS тј Муту + (0 — m/M)s? () 
i (m — 2s 
, _ та = 1/M)s? = mi — m/M)3;" 
(m = 1) 


= [n/(m = DII = 1/M) - (1 = тј Муту [s] 
= [n/(m — 114 — ИМ) – € — m/M) Ds /sl- 


Comparing (4) and (7), letting a; and оу be alphas for the infinite 

| and finite cases respectively, we see that ap > от since (1 — 1 /M) =1 

and (1 — m/M) < 1. When М is small relative to т, and the usual 

formula for a; is used, the actual а will be underestimated. It is 

difficult to tell how often this has occurred in the literature since it 

is not always clear (a) how the test is constructed and (b) how 
measures of internal consistency are to be used. 

For case A, where no exact formula is available, we can use either 

formula (2) or (3) to approximate the estimate of a. In the former 

д case, a is underestimated since o? + or. > 2+ — ТМ), 

~ Inthe latter case, o is overestimated since (1 — m/M)(o.2 + one) < 

e? + (1 — m/M)oy,”. If ar’ is the exact estimate of а in this case, 

then ар > ар > ar in terms of the notation above. Similar results 


occur in the remaining case В. 
Discussion 

i The various treatments of В and M have important implications 
not only for alpha formulas but for error of measurement and how 
| it is conceptualized. The traditional classical test theory formulation 
(see, e.g., Gulliksen, 1950) assumed (in relation to case D) that all 
of the effects in (1) were uncorrelated and, in fact, that Ат; = 0 


>> (or some constant) for ай 7 and j. Thus, the standard error of measure- 


134 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


ment was defined as с, and exactly estimated by У M Sgr, since the 
component of variance cy," = 0. If these restrictive assumptions 
are abandoned (Cronbach, et al., 1963) and if measurement error 
variance is still conceptualized as а“, then it is clear from Tables 1 
and 2 that c? can not be estimated unless every examinee responds 
more than once to every item (i.e., т > 1) with these responses being 
independent of one another. Such a data matrix would be difficult 
to come by in paper-and-pencil testing. In the more realistic case 
where r = 1, it is clear from Table 3 that there is no measurement 
error variance in cases C and F; furthermore, it is only case C that 
permits an exact estimate of a when M is finite. Our only exact 
finite estimate of a, then, is not a function of measurement error! 

It would seem desirable (at least in paper-and-pencil testing) 
to conceptually redefine measurement error (as was apparently 
done by Cronbach, et al., 1963) not only in terms of variability 
in individual performance on a given item, but in terms of variabil- 
ity due to examinee-item interaction. The most reasonable and 
least restrictive models would be those corresponding to cases А 
and D in Table 3 for M finite and М infinite respectively. Error 
of measurement could then be estimated (М5) as a “residual” 
variance by pooling the e; and Amy effects. In case A, however, 
this residual would most appropriately be се + (1 — т/Мјом“ 
which, like o, can only be under- or over-estimated by (1 — m/M) 
Му; respectively. 

The preceding formulas and discussion are best illustrated by a 
hypothetical situation in which it is desired to estimate а for a well 
defined finite population of items. Table 4 presents the analysis of 
variance for the matrix of responses of 90 examinees to 30 multiple 
choice vocabulary items sampled from an available pool of 60 such 
items. Suppose these 60 items represented a section of a standardized 
achievement test for which a teacher desired reliable scores for each 
of his students. Suppose further that the teacher has available only 
half the testing time needed to complete the entire 60-item test. 
Within his time restrictions, he can randomly sample 30-items and, 
employing the formulas discussed here, estimate a for the 60-item 
test. Applying formula (2) to the data in Table 4, we obtain а = 
61; applying formula (8), we obtain а = .80, a substantial increase 
in value. For error of measurement variance (perhaps more ap- 
propriately termed measurement residual variance) we have the 
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ТАВІЕ 4 
Analysis of Variance Table for 90 Examinees’ Responses to 30 Items 


Source df Mean Squares 
Mean 1 501.813 
Examinees 89 0.512 
Items 29 3.364 
Residual 2581 0.201 


estimates .10 or 20 for the formulas (1 = m/M)MSzr or MSar 
respectively. In view of the foregoing discussion, the teacher can 
report that the exact estimate of a for the 60-item test is in the in- 
terval .61 to .80 and that exact estimate of measurement residual 
variance lies in the interval .10 to .20. 


Summary 


Alpha estimates (and error of measurement variances) have been 
compared under various analysis of variance models represented by 
cases A through F in Table 3. ‘Assuming cases A and D to be the most 
appropriate models for finite and infinite item populations respec- 
tively, it was shown that 

1. An exact estimate of а is possible only in the infinite case (for- 
mula 2) ; the exact estimate of а in the finite case is bounded below 
and above by the alphas computed by formulas 2 and 3 respectively. 

2. If error of measurement variance is comeeptualized only as 
examinee-item response variability, it can not be exactly estimated 
in either the finite or infinite case. It can be overestimated by 
MSar. 

3. If error of measurement variance is conceptualized as a 
residual variance obtained by pooling error and interaction com- 
ponents, it can be exactly estimated only in the infinite case by 
М8. The exact estimate of error of measurement variance COn- 
ceptualized in this way in the finite case is bounded below and 
above by (1 — m/M. )М бит and Ми respectively. 
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CONSISTENT COLLEGE GRADING STANDARDS 
THROUGH EQUATING! 


JOHN R. HILLS 
Florida State University 


Ix most modern edueational measurement textbooks one finds а 
section on grading practices or marking. Usually such treatises 
discuss the problem of inconsistency of grading standards from 
instructor to instructor, course to course, and institution to in- 
stitution. A small body of literature clearly pointing out such 
problems and suggesting remedies is developing (Ebel, 1965; Fricke, 
1965; Gold, 1966; Hills, 1965, 1967; Hills and Gladney, 1968: 
Hills, Gladney, and Klock, 1967; Juola, 1968). Perhaps the best 
succinct summary is the comment in the Fall, 1966, issue of T'eacher- 
Learning Issues (Standing Council on the Improvement of Teach- 
ing and Learning, 1966) that, “At least two conclusions seem to 
emerge from the evidence presented: College grading practices 
across the nation are chaotic and in some instances inhumane.” 
This paper presents the results of an attempt to use available 
procedures to reduce some of the chaos in college grading practices, 


Method 


Gulliksen (1950) presented a method, ascribed to Tucker, for 
equating two forms of a test given to different groups. The equating 


1 Мапу students and colleagues cooperated in this set of studies. Especial 
gratitude is due to Dr. Earl Brown of Georgia State University who let me 
teach and gather data in his psychology department, to Mrs. Marilyn Glad- 
ney who did many hand computations before computer programs were writ- 
ten, to Mr. Stephen Ivens who checked those hand computations and also 
assisted in all aspects of the course which provided the third illustration, 
and finally to Mr. J. Bert Keats who wrote the computer program for the 
analyses, а report of which was published in this journal (Keats, 1970). 
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is accomplished through use of anchor items common to both groups. 
It is essentially the procedure used to equate scores from various 
forms of tests such as the College Entrance Examination Board’s 
Scholastic Aptitude Test, and it seems to be a convenient way for 
college teachers to bring consistency into their grading standards. 
For instance, it is readily possible for faculty members teaching dif- 
ferent sections of the same course to have some common items in 
in their final examinations and to equate their total examination 
scores, and even their grades, to a common scale. Or one faculty 
member could easily include common items from one term to 
another of the same course so that he could equate to a common 
scale the numbers on which he bases his final grades. 

Furthermore, Gulliksen presents the formulas for equating on 
the basis of more than one variable. Thus, different departments 
which could not use common examination items because they in- 
volved different content, could still equate on common material 
through anchor variables such as entrance examination scores, high 
school average grades, or predicted average grades. In a sense, 
this would be equating based on academic potential instead of on 
achievement in specific course material, so it might be less ap- 
pealing to faculty. However, it might still be useful in situations in 
which it is not possible to equate directly on the basis of achieve- 
ment. 

There might be objections to the use of test scores from outside 
agencies or grades from other institutions as equating variables. 
This can be overcome through a device implicit in comments by 
Fricke (1965), and in the work of Cox (1969), 1.е., the use of 
average grades in other courses at the college of current enroll- 
ment as the anchor variable. Often these cumulative average grades 
are routinely provided to each student at the end of each term 
along with his grade report for that term. Thus, they may be 
a readily available anchor which should be reasonably accept- 
able to the faculty, since it was responsible for those cumulative 
averages. 

The equating procedures are relatively simple to use, but since 
large numbers of students may Бе involved, computer programs 
have been prepared for using Tucker’s equations and for substit- 
ing the resulting values into a formula which yields a rescaled 
score for each student. (Keats, 1970). 
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Procedure 


Consider giving Form А of а final examination to the Fall 
Quarter class, and Form B to the Winter Quarter class. If one 
has included in each form a set of items, call them Part X, so that 
a score on those items can be obtained for each student in both 
classes, one makes the assumption that the regression of final ex- 
amination score on X score is the same in both groups. He then 
computes the regression coefficient in the group to which he wishes 
to equate. In this case, let us equate Winter Quarter scores on 
Form B to Fall Quarter Scores on Form A. We would first obtain 
the regression coefficient (br) of Fall Quarter (Form A) examina- 


tion scores on Part X scores. 
Then the mean for the Winter Quarter class on Form B when 
adjusted to the Fall Quarter scale for Form A is obtained from: 


Ру = Ё + by(Xw жй Ху), (1) 


where 


Fy is the adjusted mean for the Winter Quarter group (taking 
Form B) expressed in terms of the Fall Quarter (Form A) 


scale, 
F is the mean of the Fall Quarter group (on Form A), 


b, is the regression coefficient for the Fall Quarter group of 
Form A scores on Part X scores, 


Xy is the mean on Part X for the Winter Quarter group, and 


X» is the mean on Part X for the Fall Quarter group. 


The variance for the Winter Quarter class on Form B when 
adjusted to the Fall Quarter scale for Form A is obtained from & 
similar expression: 


Or б: 3 or =f br (сх = вх»), (2) 


where 


cry. is the adjusted variance for the Winter Quarter group 
(taking Form B) expressed in terms of the Fall Quarter 
(Form А) scale, 


cr? is the variance of the Fall Quarter scores (on Form A), 
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by’ is the squared regression coefficient of Form А scores on 
Part X scores for the Fall Quarter group, 


Сх». is the variance of Part X scores in the Winter Quarter, and 
vx, is the variance of Part X scores in the Fall Quarter. 


To obtain rescaled scores for individuals, equation 3 is used, 


Fy, = Ру + FE (ур, — W), (3) 
where 


Fw, is the rescaled score for Winter Quarter student +, 


ow is the standard deviation of Winter Quarter scores (on 
Form B), . 


W, is the score for Winter Quarter student î (on Form B), and 
W із the mean score for Winter Quarter students (on Form B). 


It is sometimes the case that a department with multiple sections 
of an introductory course will have a common final examination, 
and the distribution of scores on that final examination determines 
grades for individual students, with no further participation by the 
individual faculty members. This is unpalatable to both students 
and faculty. The former feel that different instructors emphasize 
different parts of a course, and the final examination cannot reflect 
all these idiosyncrasies equitably. The latter feel that there are 
elements outside the final examination that should influence the 
level of some students’ grades. 

The first of these problems is not solved by this procedure unless 
the instructors carefully plan the common final examination and 
their teaching to avoid inequities due to their idiosyncratic em- 
phases. However, if one used the final examination only as an 
anchor variable to equate the mean and standard deviation of 


grades from one section to another, the second problem is sur- 
mounted. 


Illustrations 
, Illustrations of the use of this procedure in grading are illunina- 
ting. In the first two illustrations below, four teachers taught a 
total of seven sections of introductory psychology. They used & 
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common agreed-upon final examination. Each, however, graded each 
of his students as he saw fit. Thes data and additional information 
extracted from the Registrar’s records were used for the illustra- 
tions presented here. (In the future, information from similar pro- 
cedures could easily permit scaled grades to be reported to faculty 
before final grading decisions were made. Instructors could then be 
guided by the equated results if they felt them to be relevant.) 


Equating via Final Examination 

Equating using the final examination as the anchor is displayed 
in Table 1 under “Final Examination” The section with complete 
data available for the largest number of students was selected to 
provide the scale to which each of the other sets of grades would be 
equated, This is Section 4 in Table 1. The final examination means 
and o's for the sections are listed under "Exam Parameters.” The 
means of the actual grades given to students, and their standard 
deviations, are listed in the column headed “Obtained Grades,” 
while the means of rescaled grades, and their standard deviations, 
are listed in the column headed “Adjusted Grades.” 

It can be seen in this area of Table 1, that there were some stern 
graders and some lenient graders, relative to the scale used in 
Section 4. 

Not only were the means frequently different, but also the marks 
given in Section 5 are peculiarly concentrated about the mean, with 
a standard deviation of only four-tenths of a letter grade. When 
rescaled to be comparable to the scale used in Section 4, the grades 
of these students are spread out more. This is the only case in 
this set of data of a marked change of standard deviation upon 
rescaling. 


Equating via Admissions Data 


The above illustrated rescaling or equating based on achieve- 
ment in the subject matter under study. Since these students had 
been admitted as beginning freshmen, there were records in the 
Registrar’s Office of their high-school average grades and their 
College Board SAT scores upon entrance. These data were ob- 
tained to illustrate what might happen if equating were to be 
based on academic-potential data, as might be necessary in equat- 
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ing across different subject-matters. The data appear in Table 1 
under “Admission Data.” 

The results from this method are not entirely consistent with 
those of the previous method. The basis for equating was not the 
same. Since Section 4 had quite low admissions variables (seen in 
Table 1) but was second only to Section 1 in mean grade given, 
the rescaling via admissions data tends to raise the marks in all 


the other sections. 


Equating via Common Examination Items 


The third set of data comes from a different institution and 
subject matter. In this case there are only two sections of one 
course, a graduate-level course їп advanced measurement. Each 
section had a one-hour midterm examination and two hours of 
final examination, both including multiple-choice and brief essay 
items. Fifteen items, worth 50 points, were in common to the two 
final examinations. The maximum number of points for grading 
purposes in the first (1968) group was 286; in 1969, it was 242. 
There were 10 students in 1968 and seven in 1969. (Sample sizes 
for such courses will never be large, but grades must still be given.) 

Table 2 presents results obtained using as the anchor variable 
achievement on the 50 points for the common items. The distribu- 
tion parameters for the anchor variable also appear there. It can be 
seen that the 1969 group performed much better on the common 
items, although their mean total score on final examination and 
midterm appears to be lower than that of the earlier class. (It must 
be remembered that the earlier class had a larger maximum pos- 
sible number of points, however.) Rescaled, the 1969 class appears 
to have performed much better on the average than the earlier 
class. 

The 1969 class also had a much narrower standard deviation 


TABLE 2 
Equating Via Anchor Items in Final Examinations 
Section N Obtained ^ Adjusted Anchor 
1968 10 M 205.8 — 28.0 
c 27.5 — 8.0 
1969 7 м 196.1 244.7 44.6 
c 10.0 20.6 1.8 
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TABLE 3 
Total Exam Scores and Grades, 1968 
Student Total Exam Score Grade 
1 238 A 
2 236 A 
3 230 A 
4 224 B 
5 218 B 
6 216 B 
7 190 с 
8 174 с 
9 168 с 
10 164 D 


than the earlier group on the control variable. This result might 
be attributable to a different teaching approach used in 1969, one 
based on elearly specified behavioral objectives made available to 
each student at the beginning of the term, and the inclusion of 


The actual distribution of grades given to the students in the 
1968 term appear in Table 3. The midterm and final total sums 
for the 1969 group appear in the second column of Table 4. In 


several were groped near 200, two more were near 190, and one 


was much lower at 177. It would be easy to give grades of A, B, C, 


TABLE 4 
Exam Scores and Grading Possibilities, 1969 


Tentative Grad Р 
Student Total Sum rades Rescaled Equivalent 


п Scores Grades 
1 2 A А 273 А 
2 204 B A 261 A 
8 201 В А 255 А 
$ 100 B A 251 A 
H С B 286 А 
Е ВО B 232 А 
7 177 D с с 
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` and D to this pattern, as in Tentative Grades I, or, allowing 


for the select nature of graduate students, to grade according to 
Tentative Grades П, However, rescaling to the 1968 Total Score 
scale indicated that all but one of the students in 1969 performed 
as well as the group who received marks of A (ie. above 230) 
in 1968, as will be seen in the right-hand columns of Table 4. 
Though the information is based on a small number of cases, it 
is more helpful than most of the things at which an instructor 
grasps as he tries to solve the problem of remaining consistent in 
his grading. 


Other Possibilities 


Under way currently is an attempt to rescale freshman grades 
throughout a university by use of cumulative average grades as 
an anchor variable. One possible outcome is greater predictability 
of freshman grades, sought in the past by such procedures as 
central prediction (Tucker, 1960), or adjustment of high-school 
grades (Lindquist, 1963). 


Summary 


In order to achieve greater consistency in grading practices, 
procedures are described for equating grading data from one class 
to another on the basis of common material, such as а common 
final examination, common admissions or academic-potential 
variables, or anchor sections of common material in examinations. 
The procedure is essentially that used by large testing agencies. 
Here it is given a new function. Results from use of the three kinds 
of anchor material are illustrated by data from college classes at 
the undergraduate and graduate level in psychology and educational 
measurement. Significant changes in marks assigned would be sug- 
gested by the rescalings of the data studied here. 
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A VALIDITY STUDY OF SCALES TO MEASURE NEED 
ACHIEVEMENT, NEED AFFILIATION, IMPULSIVENESS, 
AND INTELLECTUALITY 


ROBERT H. FRIIS 
Columbia University 
ALAN B. KNOX 
Columbia University 


Tug Center for Adult Education at Teachers College, Columbia 
University, under the direction of Alan B. Knox, has been conduct- 
ing а multivariate study of the factors that are associated with the 
noncollege bound young adult/s decision to enroll in adult education 
programs. This group, ranging in age from approximately 16 to 25, 
comprises а substantial proportion of high school graduates. In an 
era of ever increasing technological and societal complexity, it will 
be necessary for these individuals to acquire post high school adult 
education. One of the purposes of the Young Adult Study was to 
assess the motivational factors related to the decision to obtain 
post high school adult education. 

Among the dimensions focused upon were the variables of need 
achievement, need affiliation, impulsiveness, and intellectuality, 
which are typically measured by means of either projective tech- 
niques or objective scales. Representative of the former approach is 
Murray’s (1938) Thematic Apperception Test, whereas examples of 
objective personality measures are the Edwards (1959) Personal 
Preference Schedule which contains need achievement and need 
affiliation scales, or the Bendig (1964) need achievement scales. In 
large survey studies, it is often not feasible to use projective per- 
sonality measures because of the expense of administration and 
scoring. Objective measures of these variables, which would lend 
themselves to rapid computerized scoring and analysis, were chosen 
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for the Young Adult Study. The present article reports the results 
of a validity study of the measures of need achievement, need 
affiliation, impulsiveness and intellectuality that were employed in 
the study 


Sample Sw 


As part of a larger study of factors associated with participation 
in formal and informal adult education by high school graduates 
who do not attend college, a representative sample of 500 young 
men and women who did not go on to college in several community 
settings was chosen. One of the communities was an industrial city 
of approximately 200,000 people in the north-eastern United States, 
another & midwestern city of approximately 150,000 people, and the 
third а rural area encompassing approximately 150,000 people in 
the midwest. The samples consisted primarily of 1960 and 1965 
male high school graduates from а number of randomly sampled 
high schools in the three community areas, Also included in the 
northeastern urban sample were 50 female 1960 graduates. Each of 
the participants was interviewed and tested for an average of about 
two hours on a large number of variables, 


Measuring Instruments 


The central personality variables were need achievement, need 
affiliation, impulsiveness, and intellectual interest. Need achieve- 
ment was defined as follows: having goals, striving to accomplish 
tasks as quickly as possible, attempting to exert one's best efforts. 
Need affiliation was defined as the desire to be with other people 
even if they are strangers; the desire to share common opinions with 
others. Impulsiveness was defined as acting on the spur of the 
moment; having rapid shifts in interest; having a sense of time 
urgency and of tension. The intellectual interest variable encom- 
passed such activities as interest, in reading books and interest in 
taking part in serious discussions. 

Employed as concurrent validity criteria of the foregoing varia- 
bles were father’s occupational level, father's educational level, 
mother’s educational level, respondents expectation for advance- 
ment, verbal intelligence test score, number of organizations joined, 
amount of time spent in organizations, a measure of positive orien- 
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tation toward planning, thoughtfulness, fatalism, and adherence to 
authority. 

Scales to measure need achievement and need affiliation were 
developed by Richard Videbeck who js associated with the Center 
for Adult Education; each of the scales contained seven items. The 
impulsiveness scale, which had five items and the intellectual inter- 
est scale, with four items, were composed by Borgatta (1965). Table 
1 presents the items contained in each scale, the keyed response for 
each item, and the percentage of responses in the keyed direction. 
The item content for each scale seemed to conform to the definitions 
of the variables presented earlier. Table 1 also indicates that each 
of the scales consisted primarily of items to which between 20 and 
80 per cent of the young adults responded in the positive direction of 
a dichotomous item format. Mean scores, variances, and Kuder- 
Richardson 20 reliabilities for the four scales are presented in Table 
2. The reliability data demonstrated that each of the measures WAS 
moderately reliable. Previous Guttmann scale analysis with more 
heterogeneous samples of adults yielded coefficients of reproducibil- 
ity greater than 80. 

The remaining variables that were selected to provide criteria of 
concurrent validity had been used extensively in previous research. 
Verbal intelligence was measured via the Quick Word Test, & brief 
self-administered test that provides an estimate of mental ability 
(Borgatta and Corsini, 1964). The planfulness, thoughtfulness, and 
fatalism scales were taken from Strodtbeck (1958). See also Hum- 
mel and Sprinthall (1965). The adherence to authority measure 
was from Borgatta (1967). 


Results 


Table 3 presents the intercorrelations between each of the four 
scales and the selected validity criteria. The variables that were 
most highly positively associated with need achievement were im- 
pulsiveness and planfulness. For the noncollege bound, impulsive- 
ness may reflect the tension aspect of need achievement. In i 
respect, Borgatta (1965) found that the impulsiveness scale was 
significantly negatively correlated with a lack of tension scale for 
both high school and college students in his study. The planfulness 
scale was factor II (Hummel and Sprinthall, 1965) of the Strodt- 
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TABLE1 

Need Achievement, Need Affiliation, Impulsiveness, and Intellectual Interest Scale _ 
Items, Keyed Responses, and Percentage of Responses in the Keyed Direction 

© ڪڪ 


Percentage of 
Responses in 
Scale and Item Keyed Response Keyed Direction 
Need Achievemeni* 
I know exactly what I want out of life. agree 47 
In general, I try to make every minute count. agree 67 
Everyday, I try to accomplish something 
worthwhile. agree 83 
I almost always feel that I must do the best at 
what I am doing. agree 89 
І always do my best whether I am alone agree 83 
or with someone. 
I very often find myself doing or saying 
something for the pleasure of it, rather 
than because it serves some purpose. disagree 49 
I try harder to be content with myself than 
to be successful. disagree 49 
Need A ffiliation* 
It doesn't usually bother me to meet strangers. agree 75 
Most of the time I see things differently than 
others do. disagree 57 
I consider myself a good mixer. agree 73 
Inmany ways my ideas of right and wrong 
differ from those of people with whom I 
associate. disagree 53 
If at all possible, I avoid being alone. agree 58 
It never bothers me to go into a room by 
myself where other people have already 
gathered and are talking. agree 57 
Often I attend social gatherings just to be 
with others. agree 29 
Impulsiveness» 
I usually act on the spur of the moment. agree 49 
My interest shifts quickly from one thing to 
another. agree 34 
Tenjoy planning work carefully before 
carrying it out. disagree 20 
I rarely think things out in detail before I act. agree 28 
Tam impulsive about most things. agree 39 
Intellectual Interest» 
I enjoy intellectual pursuits. agree 70 
Ienjoy reading books. agree 59 
1 find it difficult to concentrate on reading 
anything longer than a newspaper article. disagree 69 
I don’t like discussions about serious 
problems or world affairs. disagree 80 


edis Guttmann scales developed by Videbeck achieved a coefficient of reproducibility 
right, `7 °f Six scales from tho S-Ident Form developed by Borgatta (1965). Protected by copy- 
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TABLE 2 


Mean Scores, Variances, and Kuder-Richardson 20 Reliabilities of Need 
‘Achievement, Need Affiliation, Impulsiveness, and Intellectual Interest Scales 


Scale Mean Score Variance KR 20 Reliability 
Need Achievement 4.67 2.47 .53 
Need Affiliation 4.02 2.49 .43 
Impulsiveness 1.70 2.06 .60 
Intellectual 
Interest 2.78 1.43 .56 


tation to planning, education, and mastery over the environment,” & 
likely correlate of need achievement. The other significant corre- 
lates of need achievement were need affiliation, intellectual interest, 
expectation for job advancement, organizational participation, 
thoughtfulness, and an unexpected negative correlation with father’s 
educational level. With the latter exception, these correlates reflect 
important dimensions of the construct, need achievement. The nega- 
tive correlation of need achievement with father’s educational level 
is not easily explained, since the educational level of children is 
generally related to that of parents, and increasingly higher degrees 
of achievement motivation should be necessary to obtain greater 
amounts of formal education. Father's educational level was pre- 
dictably positively correlated with mother’s educational level, fa- 
ther’s occupational level, and the young adult’s verbal ability. 
There was also a negative correlation between father’s occupational 
level and thoughtfulness which was consistent with the relationship 
between need achievement and thoughtfulness. Perhaps for those 
high school graduates who did not go on to college, but whose fa- 
thers had a high level of education, the young adult’s very low level 
of need achievement was one reason for not going on to college. 
Need affiliation was significantly positively correlated with need 
achievement, intellectual interest, organizational participation, 
planfulness, and negatively correlated with impulsiveness. Our find- 
ing of a negative correlation between need affiliation and impulsive- 
ness is consistent with Borgatta’s (1965) earlier findings that aloof- 
ness, which can be conceptualized as low need affiliation, was 
positively correlated with impulsiveness. These correlates of the 
need affiliation scale support the concurrent validity of the scale. The 
general literature on need affiliation is consistent with these find- 
ings and does not indicate that any of the remaining validity cri- 
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teria would have been significantly correlated with need affiliation. 

The highest positive correlates of impulsiveness were need achieve- 
ment and fatalism, an dthe highest negative correlates were plan- 
fulness and intellectual interest. The negative correlation between 
impulsiveness and intellectual interest is in line with the findings 
of Borgatta (1965) for high school and college students. In the 
present study, impulsiveness was also positively correlated with ad- 
herence to authority and negatively correlated with need affiliation. 
The pattern of these and the remaining correlations seemed to be 
readily interpreted and together present a picture of the impulsive 
noncollege bound young adult as a nonintellectual striving within 
the context of established values of achievement and affiliation. 

The fourth scale, intellectual interest, correlated most highly in 
the positive direction with thoughtfulness, and also with planful- 
ness, verbal ability, and father’s occupational level. Intellectual 
interest correlated negatively with fatalism. Thoughtfulness and 
planfulness fell in the same cluster of concepts along with intel- 
lectual interest; verbal ability and father’s level of occupational 
prestige are likely to reflect the factors of intelligence and social 
class that are associated with intellectual interest and educational 
participation throughout the literature. Finally, the negative cor- 
relation of intellectual interest with fatalism may reflect the ques- 
tioning aspect of intellectual interest, wherein the individual may 
perceive that he is capable of improving his existence through the 
application of intellect. 


Conclusion 


In general, the four scales were related logically to the concur- 
rent validity criteria of the present study. However, none of the 
significant relationships between the scales and validity criteria was 
above .30; the amount of variance accounted for by the correlations 
ranged from about one to nine per cent. Nevertheless, the direction 
of correlations was such that these four scales appear to have va- 
lidity and reliability when used with noncollege bound young 
adults to warrant their further refinement and use. In addition to 
assessing predictive validity, future research might concentrate on 
pas reliability and ascertaining the factorial structure of the 
Scales. 
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A COMPUTER PROGRAM FOR NONPARAMETRIC 
TESTS OF ORDERED HYPOTHESES 


JAMES J. ROBERGE 
Temple University 


In many educational and psychological experiments, the re- 
searcher can, on the basis of theory or prior investigation, specify 
an a priori ordering among the k treatment groups. For such situ- 
ations, trend analysis is often an appropriate statistical technique. 
Comprehensive discussions of the use of trend analysis with par- 
ametric data are available in statistics textbooks commonly used by 
behavioral scientists (e.g., Ferguson, 1971; Hays, 1963; Kirk, 1968; 
Winer, 1962). Analogous procedures for ranked data have been 
presented by Ferguson (1965, 1971), Jonckheere (1954a, 1954b), 
Jonckheere and Bower (1967), May and Konkin (1970), and 
Page (1963). Those described by May and Konkin (1970), for 
testing ordered hypotheses for k independent samples, and Page 
(1963), for testing ordered hypotheses for k correlated samples, are 
accompanied by extensive tables and are computationally facile. 
However, if ties exist among the k sample values, and the researcher 
has access to a program like that described in this paper, then the 
tests presented by Ferguson (1965, 1971) are a viable alternative. 

The tests described by Ferguson employ the statistic S which is 
related to Kendall’s т and offers many statistical advantages. For 
example, as increases, the sampling distribution of S rapidly ap- 
proaches the normal distribution. More precisely, for n > 10, the 
normal approximations and exact values based on the distribution of 
S are extremely close (Kendall, 1962). 


1The author gratefully acknowledges the support for this research which 
was provided by a Faculty Research grant funded by Temple University. 


157 


158 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


The program described in this paper is designed to perform а 
monotonic trend test for either k independent samples (Kruskal- 
Wallis model), ог k correlated samples (Friedman model) , using the 
computational schema outlined by Ferguson (1965, 1971). 


Input 
The job deck set-up for each analysis is as follows: 
Problem card 


Column(s) 1 = Nonparametric model (1 = independent 
samples; 2 = correlated samples) 
2-3 = Number of samples or experimental 
conditions (k) 
Format card 


This F-type variable format card indicates the location of the 
Taw scores (or ranks) on the data cards, This format may be 
punched in any of the columns on the card. 


Sample card (s) 


A card (or cards) indicating the size(s) of the sample(s). For 
independent, samples, the number of subjects in each sample is 
punched on the card(s) using 2613 format. These sample sizes are 
punched in the same order as the hypothetical ranking of the inde- 
pendent samples. For correlated samples, the number of subjects 


in the sample (or matched samples) is punched on the card using 
13 format. 


Data deck 


These cards which contain the data for each sample (or experi- 
mental condition) must be punched in accordance with the format 
specified on the F-type variable form 
pendent samples, 
each sample beginn 
must be arranged s 


n a new card. Moreover, the data 
matched subjects) must be punched 
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in the same order as the hypothetical ranking of the correlated 
samples. 


Last card 


If the user wishes to terminate the program, then the card im- 
mediately following the data deck must have the word FINISH 
punched in columns 1 to 6. However, if the user wishes to analyze 
another set of data, then this card is a blank one and the job deck is 
arranged sequentially (as described above) beginning with the 
problem card. 


Output 
The computer output for k independent samples includes (а) the 
mean rank for each sample, (b) the value of S, and (с) the value 
of the normal deviate 2. Similarly, the computer output for Ё corre- 
lated samples includes (a) the mean rank for each sample, (b) the 
value of 58, and (с) the value of the normal deviate 2. 


Capabilities and Limitations 
The program is written in FORTRAN IV for processing by com- 
puters in the IBM 360 (or the CDC 6000) series. It can handle a 
maximum of 30 samples (or experimental conditions) and 200 sub- 


jects per sample (or experimental condition). Jobs may be run se- 
quentially as described above. 


Availability 
Copies of this paper and a source listing which includes input 
and output data for sample problems can be obtained by writing 


to Dr. James J. Roberge, Temple University, Department of Educa- 
tional Psychology, Philadelphia, Pennsylvania 19122. 
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CRONB: A FORTRAN IV PROGRAM TO COMPUTE 
VARIANCE COMPONENTS FOR VARIOUS 
EXPERIMENTAL DESIGNS 


EDWIN T. CORNELIUS Ш 
J. ARTHUR WOODWARD, AND 
ROBERT G. DEMAREE 


Institute of Behavioral Research 
Texas Christian University 


As Gleser, Cronbach, and Rajaratnam (1965), and earlier, Cron- 
bach, Rajaratnam, and Gleser (1963, have pointed out, the problem 
of determining the “reliability” of a behavioral measure is better 
phrased in terms of a systematic appraisal of the multiple sources 
of variance that may affect the observed scores on a measure. Their 
approach to the problem, elaborated as “Generalizability Theory,” 
places more emphasis on estimating variance components and de- 
fining universes of generalization than do the “classical” theories 
of reliability. 

There are several advantages of Generalizability Theory over 
alternative methods of assessing the reliability of a set of scores. 
First, an analysis-of-variance approach to reliability forces the 
researcher not only to consider the several factors, or “facets,” of a 
measure, but also explicitly to consider the universe score to which he 
wishes to generalize. Second, a “multifacet” approach to reliability 
allows the researcher to appraise the effects of “interactions” among 
the sources of variance for his data. Third, the results of a multifacet 
data analysis enable the investigator to design more efficient pro- 
cedures for collecting data. That is, the results from an initial study, 
the “Generalizability (G)-study,” may indicate that a particular 
facet should be excluded from consideration, or that several facets 
should be combined, or sampled jointly, to increase the “reliability” 
of the measure in a “Decision (D)-study.” Finally, the results 
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from one multi-facet study serve to answer questions concerning the 
data that previously took several reliability studies to assess. 

In a recent publication (Cronbach, Gleser, Nanda, and Rajar- 
atnam, 1971), the authors have developed the theory in detail 
through systematically examining multiple sources of variance in a 
set of scores from various nested and fully-crossed experimental de- 
signs, for both the random and mixed-model cases. As presented in 
this text, the notion of the variance component, which is basic to the 
theory, is inextricably involved in computing “coefficients of gen- 
eralizability,” universe scores, confidence intervals about universe 
scores, and types of “error” that occur in behavioral data. 

The present paper describes a series of computer subroutines that 
will perform variance analyses for the eight experimental designs 
that are presented in the Cronbach et al. (1971) text. The series of 
subroutines, which was written in FORTRAN IV for a fairly small 
computer, IBM 1800, 32K, can be easily modified for any computer 
system. 


Description of the Program 


The package of programs is presently set up to operate as follows: 
CRONB (the mainline program) reads in the raw data from the 
experiment. CRONB then calls SUMSQ, which computes sums of 
squares for each source of variance as if it were a fully crossed 
one-facet or two-facet study (corresponding to a two-way or three- 
way ANOVA, respectively). The sums of squares are then passed to 
one of the eight “design” subroutines which recombines the sums of 
squares and degrees of freedom according to the particular experi- 
mental design of the user. Variance components are then computed for 
each source of variance. After each component is caleulated, sub- 
routine CHCK tests whether the component is greater than zero. If 
the component has a negative value, a zero is substituted for that 
component, and calculations are continued. 


Input 
Input is from cards, tape, or disk. The input format is contained 
in a READ subroutine, which may be altered by the user to fit the 
characteristics of his data. The user specifies on a control card & 


title, the number of facets, the name of each facet, and the number 
of levels of each facet. 


EDWIN Т. CORNELIUS Ш, ЕТ AL. 163 
Output 


Output on the printer includes tables of means and joint-means, 
as well as the sums of squares, degrees of freedom, mean squares, 
and variance components for each source of variance. 


Availability of the Program 


Extensive documentation, as well as a listing of the program in- 
cluding sample input and output, is available at cost in care of the 
Systems Programmer, Institute of Behavioral Research, TCU Sta- 
tion, Fort Worth, Texas, 76129. 
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А COMPUTER PROGRAM FOR INVESTIGATING THE 
EFFECTS OF SAMPLING ERROR ON MULTIPLE 
REGRESSION EQUATIONS 


JAMES T. BOLDING 
University of Arkansas 


Тнв program discussed in this paper generates pseudo random 
numbers from a normal population and uses them to construct re- 
gression models. Such models may be used to demonstrate and 
study the effects of sampling error on the size of the multiple В and 
on the values of the raw weight coefficients. The effects depend upon 
the size of the sample and the percentage of criterion variance con- 
tributed by the predictors selected for the model. 

The target mean and standard deviation for the pseudo random 
numbers may be controlled by the user. All generated numbers 
are from the same population on a specific run of the program. А set 
of 10 scores is generated for each of the subjects of the sample. For 
each subject, the sum of the 10 scores is computed and used as the 
criterion score. A regression model is constructed using а subset of 
the 10 contributing scores to predict the criterion. Since the 10 
contributing variables have the same target variance and are pair- 
wise independent, each variable is expected to contribute ten per 
cent of the variance of the criterion. The expected R-square for a 
given model is thus the number of predictors selected times 10 
per cent. Because the criterion is generated as a linear combination 
of the set of 10 predictors, R-square is equal to 1.00 for any model 
Containing all 10 contributors. 

The demonstration is set up to construct a total of 600 equations. 
The number of predictors ranges from one through 10. The number 


_ Of subjects ranges from 50 through 300, increasing by 50 on each 


change. For each combination of a number of subjects and a num- 
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ber of predictors, 10 different samples are generated and the cor- 
responding ten models determined. 

A modification of the forward Doolittle method is used for the 
computation. The pseudo-random numbers are generated by the sub- 
routines RANDU and GAUSS from the IBM Scientific Subroutine 
Package. A table of 1000 numbers is constructed and a random 
integer selects a number from that table. The program is set up for 
double precision computation in the FORTRAN language. 

A copy of the listing may be obtained from the writer at College 
of Education, University of Arkansas, Fayetteville, Arkansas, 


72701. 
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A COMPUTER PROGRAM FOR TWO-WAY FIXED 
EFFECTS ANALYSIS OF VARIANCE WITH 
DISPROPORTIONATE CELL FREQUENCIES 


JOHN D. WILLIAMS aw» ALFRED С. LINDEM 
The University of North Dakota 


Tue solution to the disproportionate case of the two-way fixed 
effects analysis of variance is complicated by the existence of more 
than one solution, the different solutions being dependent upon the 
assumptions of the researcher. Some of these solutions are approxi- 
mate, including discarding data, estimating missing data, the 
method of unweighted means, the method of expected cell frequen- 
cies, and the method of weighted means. Often an exact, or least 
squares, solution is alluded to; there are, however, more than one 
least squares solution. 

The present program allows for the selection of any (or all) of the 
following least squares solutions: (a) the method of fitting constants, 
& commonly accepted solution, described in Scheffé (1959) and An- 
derson and Bancroft (1952), a method that adjusts each main effect 
for the other main effect; (b) the hierarchical model (Cohen, 1968), 
which allows for one effect to take precedence over the second ef- 
fect; the first main effect is unadjusted, and the second main effect 
is adjusted for the first main effect; and (c) the unadjusted main 
effects method, in which neither main effect is adjusted for the other 
main effect. In all three methods, the interaction effect is adjusted 
for the two main effects. The three least squares methods and the 
previously mentioned approximate solutions are compared by Wil- 
liams (in press). 


Input 
; Data cards contain for each observation the criterion, (or criteria, 
if several different problems are to be solved), the row membership, 
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the column membership, and cell membership. The cell membership 
is found by counting the cells in the first row, then the second row, 
etc., until the particular cell is found that contains the observation 
for a given card. Parameter cards specify problem identification, 
number of criteria, number of observations, number of rows, num- 
ber of columns, number of cells, and the type(s) of solutions desired. 


Limitations 
The maximum dimensions are as follows: 
99,999 observations and 30 cells. 


Computer and Program Language 


The program is written in FORTRAN IV level F for the IBM 
360 (128k). 


Output 
i А complete analysis of variance table is included for each solu- 
tion desired, including the sources of variation, the degrees of free- 
dom, the sum of squares, the mean squares, and the Ё values. 


А printout of the program and sample output will be supplied on 
request. 
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A COMPUTER PROGRAM FOR CALCULATING THE 
POWER OF F TESTS IN ANALYSIS OF VARIANCE 
AND COVARIANCE FOR SPECIFIED ALPHA 
LEVELS, SAMPLE SIZES, AND EFFECT SIZES! 


ROBERT 5. BARCIKOWSKI 


AND 
NORMAN HOLTHOUSE 
Ohio University 


THe use of experimental designs that entail small sample sizes 
often leads to the application of statistical tests that have very low 
power. That is, there is small probability of detecting significant 
differences among the treatment levels of the design. The researcher 
in the behavioral sciences who is concerned with type II errors will 
find the program discussed in this paper a useful aid in calculating 
the power values associated with appropriate F tests in analysis of 
variance and covariance designs. The program is also useful in help- 
ing the researcher select the appropriate significance level (alpha), 
effect size, and sample size for his experimental design. This favor- 
able set of outcomes is due to the fact that the program contains 
several options which allow the researcher to print a table of power 
values dependent upon a specified value of alpha, with a range of 
Sample and effect sizes. 


Input 


The method used to generate the power values in this program is 
based upon Laubscher’s square root normal approximation of non- 
central F (Laubscher, 1960). The program contains three options 
which allow great flexibility in printing power values. 

Im OMNE MR 


+The term “effect size” is used by Cohen (1969). It is used here аз а 
Measure of the treatment effects in an analysis of variance design. 
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Problem Card 

Column 3 = Number 1, 2, or 3 to call options 1, 2, or 3. 


Option 1 


Problem Card 
Columns 1-5 = Degrees of freedom for the F test used. 1 
6-10 = Degrees of freedom for the denominator of 
the F test used, 
11-15 = Number of elements (subjects) in a level of а 
factor (group or cell). E 
16-20 = Effect size (f), 


where с = the population standard deviation. 


Yack – 2) 


Lg бле) 


N 
К = the number of levels of a factor. 

та = the sample size in the ith level (i = 1, 2, --- , k). 

X, = the mean of the observations in the ith level. 

N = the total number of observations in the design (i, во | 
N= Suis т). 1 

X = the mean of the total number of observations in the | 
design. 


21-25 = Alpha, level of significance for the F test. 
Option 2 


Problem Card 1 
Columns 1-15 — Factor name, alphanumeric. } 
16-20 = Degrees of freedom for the numerator of the 
F test. у 
21-25 = Product of the levels of each factor in the 
design. à 


26-30 = Alpha, level of significance for the F test. 
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Problem Card 2 
Columns 1-2 = Number of sample sizes to be read in, limit of 
50. 
Columns 3-7 = First sample size. 
8-12 = Second sample size, etc., up to 50. 
Problem Card 3 
Columns 1-2 = Number of effect sizes for which power values 
are to be calculated, limit of 18 per test. 
3-5 = Value of first effect size. 
6-8 = Value of second effect size, etc., up to 18. 


Option 3 


Problem Card 1 
Columns 1-5 = Alpha, level of significance for F test. 
6-7 = Number of factors in the design. 
15 = Alphanumeric label for first factor. 
16-19 = Number of levels for the first factor. 
20 = Alphanumeric label for the second factor. 
21-24 = Number of levels for the second factor, etc., 
for all factors, limit of ten factors. 


Problem Card 2—Same as in Option 2. 
Problem Card 3—Same as in Option 2. 


Output 
Option 1 
The output from option one includes the following: 


1. degrees of freedom specified, 

. alpha level specified, 

. effect size specified, 

. sample size specified, 

. critical value of F calculated. 
. power of the F test calculated. 


о сл нь có м 


Option 2 
The output from option two includes a full table of power values 


for the range of sample sizes and effect sizes specified. Appropriate 
headings listing degrees of freedom, critical values of the F statistic, 
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alpha level, sample size, and effect size are provided in the output 
table. 


Option 3 
The output from option three is identical to that supplied by op- 


tion two for each unique source of variation in a fixed effects analysis 
of variance design. 


Capabilities and Limitations 
The program is written in FORTRAN ТУ. Compile and load time 
Ohio University’s IBM 360 model 44 G—level compiler was ap- 
proximately 25 seconds. Execution time for option 1 was approxi- 
mately six seconds. Execution time for a single power table in 
options 2 and 3 with 50 sample sizes and 18 effect sizes was approxi- 


mately 350 seconds. Jobs may be stacked so that each option may 
be run either several times or in sequence with any other option. 


Availability 
Copies of a source listing of this program with example output can 
be obtained by writing to Dr. Robert 8. Barcikowski, Ohio Uni- 


versity, Department of Educational Research, Statistics, and Eval- 
uation, McCracken Hall, Athens, Ohio 45701. 
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INDIVIDUALIZED TAKE-HOME TEST ON ANOVA 
AND t TESTING 


WILLIAM R. KENNEDY ax» PAUL A. GAMES 
The Pennsylvania State University 


Тҥз program generates individual data sets for students to 
take home as an examination or an exercise. The student is given 
а data sheet and an answer sheet. The instructor gets an answer 
sheet for each student filled in with the correct answers. Since 
each student has his own unique set of data, the program allows 
the instructor to eliminate speed from his testing and removes the 
risk of copying from take-home exams. 

Description of Program 

The program has two major options: one-way analysis of vari- 
ance or ¢ tests on independent groups (subjects nested under treat- 
ments). The analysis of variance (ANOVA) option generates up 
to 10 treatment groups; the ё test option has only two inde- 
pendent treatment groups. Unequal n’s are allowed in both op- 
lions up to a maximum n of 20. 

The program computes and prints out the sums of squares, 
Means, and unbiased sample variances for each treatment. With 
equal n’s, under the ANOVA option, conventional multiple com- 
parison tests—Fisher’s Least Significant Difference, Tukey’s 
Wholly Significant Difference (WSD), and the Newman-Keuls 
(Games, 1971)—can be run by the student. If unequal n are spe- 
Cified, then these comparisons are bypassed. The ANOVA option 
prints out the basic computational steps needed for the omnibus 
F test and a summary table to be filled in by the student. There 
is also an a priori ¢ test run between the first and last treatment 
groups using mean square within cells. The FMAX test for ho- 
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mogeneity of variance is run. The ¢ test option does both a regular 
t and the Behrens-Fisher statistic with Welch’s solution for the criti- 
cal value from the £ distribution (Kohr, 1970; Welch, 1947; Winer, 
1962; p. 37). Confidence intervals are computed after these two tests. 

The computer prints an ANOVA student answer sheet or а t- 
test answer sheet or no answer sheet at all. If the latter choice is 
made the program still prints out the answers for the instructor 
in answer sheet format. The output data may be decimal or in- 
teger. The program generates data from treatment populations 
(the parameters of which are input by the instructor) through 
use of a random number generator. The random number gen- 
erator and two subprograms (for finding probabilities or critical 
values from the F distribution) from the Pennsylvania State Uni- 
versity library are contained in separate subroutines for easy sub- 
stitution of local system generators and subprograms without nec- 
essitating a change in the main program. 

The instructor must input treatment population means, stand- 
ard deviations, sample sizes, studentized range critical values for 
the WSD or Newman-Keuls test (Winer, 1962, p. 648), descrip- 
tion cards (up to 25), student name cards, and critical values for 
the F and regular ¢ statistics (to agree with tables in the student’s 
textbook). The student usually receives two sheets. The first sheet 
contains the student’s name, social security number, title, experi- 
mental description, and treatment data. The optional answer sheet re- 
peats the name, number, lists the desired level of significance, and 
supplies the appropriate requests with underlined blanks for the 
student to fill in. The instructor obtains the same answer sheet 
as the student but with the answers filled in. The program, which 
was written in FORTRAN IV, uses approximately 46,000 bytes. 
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А COMPUTER PROGRAM FOR FACTOR ANALYZING 
UP TO 450 VARIABLES (MAXVAR) 


EDGAR HOWARTH axo PETER Н. BRAUN 
University of Alberta, Edmonton, Canada 


Because of the complexities of calculations, factor analysts, 
with one or two exceptions (e.g., Horst, 1966; Parker and Veld- 
man, 1969; Sells, Demaree, and Will, 1970) have until recently 
confined their work to relatively small data matrices. Hence, few 
factor analytic programs are available which will conveniently 
handle in excess of 100 variables. Moreover, large scale factoring 
in the range 200 to 450 variables (or greater) has been extremely 
difficult and time consuming. 

However, since most comprehensive psychological inventories 
have in excess of 100 items, there is an urgent need for fast, ac- 
curate, large-scale programs. The desire to factor analyze such 
inventories by items has led to the present development of а 
factor analytic program which will handle up to 450 variables. It 
has been named MAXVAR. 


Technical description: 


MAXVAR is presently divided into three independent com- 
ponent parts: 


1. Correlation routine. Input = raw scores, output = correla- 
tion matrix based on Pearson’s product moment formula. 


— 

1 Horst (1966) factored 288 items (population of only 211 subjects) using 
principal components followed by a varimax solution. This project took 
several hours but represented a considerable achievement at the time. Since 
then, Parker and Veldman (1969) factored 300 adjective check list (ACL) 
items on a population of 5017 subjects, and Sells, Demaree and Will (1970) 
have recently factored 600 items on a population of 2011 subjects. The latter 
investigators commented that “the sheer magnitude of the computations in- 
volved tested the limits of the computers used.” 
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2. Factoring routine. Input = correlation matrix, output = 
principal axis factor matrix based on a Householder 1 
inson algorithm (Howarth, 1971; Wilkinson, 1960). x 

3. Rotation routine, Input = principal axis, or other factor 
matrix, output = normalized varimax rotated factor mat- 
rix, (Kaiser, 1958) with columns re-ordered according to 
their variance contribution. " 


The total package requires three control cards to pass neces- 
sary information, such as title, parameters, and format of input - 
data onto the programs. The programs, which are written in 
FORTRAN IV Н, are presently set up to operate under the IBM 
OS release 19 on a Model 360/67 at the University of Alberta, 
MAXVAR requires 500 К of user available memory, and ad- 
ditionally, either dise or tape space for storage of intermediate 
outputs. ] 

The first study using MAXVAR involved 401 variables. It re- - 
quired a total execution time of 24.95 to 29.22 minutes according _ 
to sample size analyzed оп the above machine (exclusive of the _ 
eard to tape procedure which preceded data processing). About: 
half this time is taken up by the principal components solution. 

A listing of MAXVAR, along with control card instructions, 
and source decks (if required) may be obtained from Peter H. 
Braun, Department of Educational Psychology, University of Al- 
berta, Edmonton, Alberta. : 

It is realized, of course, that the large amount of core require- 
ment may inhibit some researchers who lack appropriate facil- 
ities to take advantage of MAXVAR. Hence, the authors wish to 
advise that recent changes in administrative policy at the Uni- 
versity of Alberta now make it possible for outside parties to 
purchase time on this university’s large scale computing facility. 
Inquiries for the details of such arrangements may be directed 
to the Computing Center (Director: Dr. Dale Н. Bent). 
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4 FORTRAN PROGRAM FOR INVERTING A POSITIVE 
DEFINITE MATRIX: 


HENRY F. KAISER 
University of California, Berkeley 


KERN У. DICKMAN 
University of Illinois, Urbana 


Since publishing, in this journal, our paper “Program for In- 
verting à Gramian Matrix" (Dickman and Kaiser, 1961), we have 
had numerous requests for the actual program. This note places 
on record а FORTRAN version of the program (omitting input- 
output statements). 

Input the real symmetric positive definite matrix А = [A(1J)] 
of order N. Space for an N-vector T = [T (1) ] is also needed. 


NLESS1 =N —1 
DO100K —1,N 
IF (A(1,1) — 0.000001) 101, 101, 102 

102 X = SQRT(A(1,1)) 
DO 103 I = 1, NLESS1 

103 T() = А(1+1,1)/Х 
T(N) = 10/Х 
DO 104 J = 1, NLESS1 
DO 104 I = 1, NLESS1 

104 A(I,J) = AG + 1J + 1) — 7(1)* T(J) 
DO10801—1,N 

105 A(LN) = —T(I)*T(N) 
DO 100 J = 1, NLESS1 

1 The research reported in this note was supported in part by 


On Basic Research in Education (Patrick Suppes, chairman), 
Education. 


the Committee 
U.S. Office of 
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100 A(NJ) = А(Ј, №) 
DO 106Ј = 1, М 
DO 1061 =1, Х 
106 A(IJ) =—А(Т,Ј) 
Output А = [А(1,7)] which is now the inverse of the original 
A 


GOTO 107 
101 Output “THE MATRIX IS SINGULAR, VERY NEARLY 
SINGULAR, OR INDEFINITE” 
107 END. 
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AN ITEM ANALYSIS PACKAGE FOR LIKERT SCALES! 


GEORGE W. BOHRNSTEDT 
University of Minnesota 


RICHARD T. CAMPBELL? 
University of Wisconsin 


ТТЕМРАСК is a program that allows the’ user to build scale 
scores from a set of items. It also calculates the reliability of each 
score, the characteristics of the various items, and the intercor- 
relations among the various scores. Specifically, the program pro- 
vides the following output: : 

1. Means, standard deviations, and distributions of each of the 

items and each of the scores. 
2. Intercorrelations of all the input items. 
3. Intercorrelations of the scale scores. 
4. Item-to-total score correlations. 
5. Corrected item-to-total score correlations using the method 
suggested by Cureton (1966). 

6. Cronbach's coefficient alpha as a measure of the reliability 
of the scale score (Cronbach, 1957). 

7. The standard error of measurement for each score. 


The program also provides the following options: 
1. Punched output of total scores for each subject. 
2. Differential integer weighting of input items. 
3. Replacement of missing data codes with pre-determined values. 
4. Addition of a constant to total scores. 
1We would like to thank Thomas Heberlein for numerous comments and 


Suggestions which greatly improved the final version of the program. 
2 Support for this project was provided by the National Institute of General 


Medical Sciences, Grant Number GMO-1526. 
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The program will handle up to 125 items from which a maximum 
of 50 scores can be built. The maximum number of items in any 
one score is 50. 


Analytic Techniques 


ITEMPACK uses the algorithims suggested by Bohrnstedt 
(1969). After creating total scores, the program creates a vari- 
ance/covariance matrix containing each item and each scale score. 
Let sy be the sample covariance of the ith with the jth item and 
sê be the variance of the ith item. Then if k items are summed 
into a composite X, the variance of X is: 


зу = Ds +2 эрэ за (1) 


The correlation of item i with X, mx, uncorrected for the inclu- 
sion of in X is given by 


k 
Tau »» 84/S;8x. (2) 


A variety of methods exists for calculating an item-to-total 
Score correlation which takes into account the spurious result 
caused by the inclusion of item 7 in the correlation тх where $ 
contributes to X. Cureton (1966) pointed out that the reliability 
of the total score, X, with item $ missing varies inversely with the 
reliability of the item itself. He suggested the replacement of the 
item with a rationally equivalent item (parallel) which would 
thereby leave the total score reliability unchanged. He demon- 
strated further that if the score is factorially homogeneous, then 
one need not actually replace the item. Let rx be the uncorrected 
item-to-total correlation, S; = 8,/8х, and rxx = the reliability of 
the total score. Then the corrected item-to-total correlation ri’ 
for the ith item is: 


хх ae Tux 8, (8) 


In certain cases the term under the radical will become negative 
in which case the following approximation for (3) is used: 


Tix = те — 


Тих 
Tae 28; (4) 
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| When this occurs, an appropriate warning is printed. Finally, the 
coefficient of internal consistency, alpha, is given by: 


NM о 


where k is the number of items and the standard error of measure- 
ment is 


з. = Узх(1 — тех) (6) 
Programming Techniques 


The program is available in FORTRAN IV and FORTRAN 
У. A minimum of 32K of core is needed, although manipulations 
of DIMENSION statements allow for great flexibility in this re- 
gard. The program reads the control cards one by one and checks 
each for errors. Control card checking continues through the en- 
tire control card set even if errors are discovered in a given card. At 
the end of control card processing, the analysis phase is ini- 
tiated if no errors have been discovered; otherwise the program 
aborts. An end of file is discovered either by the system or by the 
program’s recognition of a user supplied “last ID.” This fea- 
ture permits the checking of control cards using only the first 
т observations of a data set. Multiple data sets may be processed 
in the same run. Input may be from tape or card, but output 
Scores are on card only. 


| 
| 
| 
| 


Availability 
Program decks, documentation, and test data are available 
у from the second author. 
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A PROGRAM TO CARRY OUT CLUSTER ANALYSIS 
BY HOMOGENEOUS GROUPING 


RAM K. GUPTA 
University of Alberta 


J. DALE BURNETT 


McArthur College of Education 
Queens University, Kingston, Ontario, Canada 


A general problem confronting many researchers involves the 
identification of subtests from a large pool of items such that 


1. each subtest has maximum homogeneity (i.e. maximum KR- 
20 or alpha-coefficient or Hoyt’s reliability), 

2. the various subtests have maximum independence (ie. min- 
imum intercorrelations), making it possible to provide maximum 
discrimination on each trait or characteristic. 

This problem may be handled through factor analysis, which 
has certain disadvantages, including the following: 

1. The form of the results. Typically, one obtains а loading or 
weights for each item on each factor. Thus it is necessary to mod- 
ify the weights so that they are arbitrarily made either 1 or 0 (an 
item is either included in a subtest or excluded from it). 

2. The use of factor analysis requires a level of statistical so- 
phistication well beyond the reach of an ordinary researcher. 

3. Objective guidance is not available for deciding upon the 
number of factors one should extract in a given situation. 

4. The researcher does not know the reliability of the resulting 
factors so that he runs the risk of extracting too many weak fac- 
tors—those having very low reliabilities. 

A procedure whose rationale is more directly associated with 
the principle and technique of test development and which can 
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be easily understood by the practitioner would thus appear to 
have considerable merit. Cluster analysis is such an alternative. 
Of the various procedures of cluster analysis, the one presented 
by DuBois, Loevinger, and Gleser (1952), Loevinger, Gleser, and 
DuBois (1953), and Loevinger (1947, 1948), is perhaps the sim- 
plest and intuitively the most appealing. It can be easily used 
on a desk calculator even for relatively large matrices and gives 
results which are often very similar to those obtained from factor 
analysis (Gupta, 1968). 

Intuitively speaking, the process for constructing homogeneous 
subtests is as follows: Given a finite pool of items, select the 
nucleus of three items which are maximally correlated with one 
another and, therefore, give the highest discriminating power. 
Add to this nucleus, a fourth item which is the most closely re- 
lated to the nucleus in the sense of giving the highest discrim- 
ination. Similarly, add the fifth, the sixth, and subsequent items 
until a stage is reached when the further addition of an item 
either fails to contribute appreciably to the saturation coefficient, 
8, of the cluster or actually starts lowering it. This signals the 
approach of the point at which the process of extracting the first 
cluster should be terminated. S is given by equation (1). 


LEE 


22 C 
ее (1) 
TL У У Са 
=1 1+1 del 
where, + = subscript for the items extracted from the item pool, 
and renumbered from 1 to n, 
V; = the variance of item 7 
С; = the covariance of item 7 with item j 
п = the number of items in a cluster at a given stage in an 
analysis. 
The algorithm is repeated to form the next cluster with the new 
pool of items consisting of those items not yet included in а 
cluster. This process of extracting successive clusters is continued 


1 Discriminating power is defined as the ratio of the sum of the covariances 
of n items to the sum of the variances of the same n items. This term is ге- 
ferred to as the “maximizing ratio” by Loevinger et al. (1953). 
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until every item is either included in one of the clusters or has failed 
to align itself in forming a cluster having a desired coefficient of 
reliability. This intuitive approach can be guided by two quan- 
titative criteria: 


(a) maximizing S or (b) maximizing homogeneity as given by 


(2). 


п 
КЕ – 20 = =—- (9) (2) 


Homogeneity is merely a monotonic function of 8, but is far bet- 
ter known, making it the preferable option. However, the pro- 
gram presented here offers both the options. 

At each stage, the best item is identified and added to the ones 
already in the cluster and, at the same time, the items which are 
likely to introduce heterogeneity are discarded on the basis of an 
objective criterion. This device guards against introducting “func- 
tional drift” and contributes to making the technique elegant, 
rigorous, and straight-forward. 

As an item is added, it is easy to calculate its contribution to 
the homogeneity of clusters. This enables the researcher to make 
decisions at each stage—a facility not afforded by factor analysis. 

The quantitive criteria can be used as a valuable adjunct to the 
judgment of the researcher and not as the binding consideration. 
The clusters so developed can thus have “content” as well as 
“construct” validities. 

Researchers such as Ericksen (1962), Gee and Clark (1956), 
Strommen (1963), Strommen and Gupta (1971), have used the 
technique with good advantage. The present write-up is in- 
tended to enliven the curiosity of other researchers also. 


Program 


A copy of the program written by Joseph A. Klock for the 
Regents of the University System of Georgia was modified for 
use at the University of Alberta on IBM 360/67. This program 
Provides option between the two criteria given above for term- 
inating a cluster. 

When the numerical value of the selected criterion is decreased 
by a certain pre-specified amount or tolerance level, the cluster 
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being formed is terminated and a new one started. The user is 
free to specify arbitrarily the size (for example, .001) of the tol- 


cluster prematurely because of rounding errors and/or small 
chance fluctuations. 

Another feature of the program relates to the item pool avail- 
able for the formation of a new cluster. According to Loevinger, 
et al. (1953), all the items that have been included in the pre- 
ceding cluster(s) are removed from the pool. However, an op- 
tion provided by the program permits the removal of only those 
items which initiated the preceding cluster(s). It should be re- 
called that a cluster always begins with three items. 

The output includes a correlation matrix of all the items with 
each cluster. Here again, a choice is provided for the use of point 


erance level. Using zero as the tolerance level may terminate a 


Three cards must precede the first data card. 


1. Title card—contains an alphanumeric description of the job 
being run. 

2. Parameter card—specifies the characteristics of the data and 
the options selected. 

3. Format card—specifies the format of the input data. 


The data cards consist of student responses to a set of test 
items. Thus, a column might contain a 1 if a student answered 
“true” and a 0 if he answered “false” to a given question. 


SPE FE OSE BO 


. Matrix containing sums of covarianes of an item with all 


biserial and biserial correlations. The user specifies his option. 
Input 
t 
| 


Output 

Item covariance matrix. | 
Step-by-step flow of cluster formation (an example follows). 

Cluster covariance matrix. 

Cluster correlation matrix. 

Cluster standard deviations. 

Point Biserial or biserial correlation matrix (items with 
clusters). 


items in a cluster (equal to numerator of 7554) | 
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Example illustrating step-by-step flow of cluster formation. 


CLUSTER NUMBER 1 
PRIME CLUSTER VARTABLES 1 3 10 
3 ITEMS SUM OF COVARIANCES, C= 2.32987 SUM OF VARIANCES, Ма 5.56507 
MAXIMIZING RATIO , W=C/V= «41866 КЕ-20 = 0.68359 
VARIABLES DELETED FROM CONSIDERATION = 5 6 811 13 
ADDING VARTABLE 12 
4 ITEMS SUM OF COVARIANCES, Ce 4.21744 SUM OF VARIANCES, V= 7.57801 
MAXIMIZING RATIO, W=C/V = 0.55654 КЕ-20 = 0.70234 
ADDING VARIABLE 7 


5 ITEMS SUM OF COVARIANCES, C= 6.85869 SUM OF VARIANCES, V= 10.59132 
MAXIMIZING RATIO, W=C/V = 0.64758 KR-20 = 0.70537 
END THIS CLUSTER 
KUDER-RICHARDSON 20 OF THIS CLUSTER= 0.705374 VARIANCE= 24.308685 
STANDARD DEVIATION= 4,930384 


VARIABLES IN THIS CLUSTER ARE 
eS oT 10212 

VARIABLES REMAINING АВЕ 

254 5 6 зом 


CLUSTER NUMBER 2 
PRIME CLUSTER VARIABLES 13 2 6 
3 ITEMS SUM OF COVARIANCES, C= 2.88350 SUM OF VARIANCES, \=8,53475 
MAXIMIZING RATIO , WeC/V = 0.33785 КЕ-20 = 0.60486 


VARIABLES DELETED FROM CONSIDERATION = 8 
ADDING VARIABLE 9 


4 ITEMS SUM OF COVARIANCES, Ce 4.73112 SUM OF VARIANCES, V= 10.32284 
MAXIMIZING RATIO, WeC/V = 0.45832 КЕ-20 = 0.63767 


VARIABLES DELETED FROM CONSIDERATION = 4 
ADDING VARIABLE 


5 
5 ITEMS SUM OF COVARIANCES, C= 7.43266 SUM OF VARIANCES, Ve 13.69672 


MAXIMIZING RATIO, WeC/V = 0.54266 КЕ-20 = 0.65057 
END THIS CLUSTER 


KUDER-RICHARDSON 20 OF THIS CLUSTER* 0.650572 VARIANCE 28.562032 
STANDARD DEVIATION= 5.344346 


VARIABLES IN THIS CLUSTER ARE 
256 918 
VARIABLES REMAINING ARE 
8 1 

Additional information about the program and its availability 
may be obtained by writing to Dr. Ram K. Gupta, Department 
of Educational Psychology, University of Alberta, Edmonton, 
Alberta, Canada. 
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THREE PROGRAMS FOR THE COMPARISON OF 
RESPONSE PATTERNS TO SELECT HOMOGENEOUS 
GROUPS AND TO DETERMINE THE DEGREE OF 
HOMOGENITY! 


ELISABETH Н. JOSLIN an» ALAN Г. CARSRUD 
University of New Hampshire 


Tis series of programs developed on an IBM 360/50 is used 
io group subjects according to their responses on a Likert-scale 
questionnaire. The programs, which are flexible, allow the user to 
determine the format of responses, the amount of similarity re- 
quired for grouping, the minimal group size, and the items used 
for selection. This program was initially written to test the at- 
traction-similarity paradigm of Byrne and Clore (1967). 

Written in FORTRAN IV, the programs handle a maximum 
of 400 subjects and 40 items per subject and require approxi- 
mately 130K bytes of storage. 

Program 1—Program 1 collapses the input data into a response 
by item matrix over subjects. This format produces a response 
distribution for each item which is then used to select appropriate 
items for further analysis. 

Program 2—Program 2 may be used to produce certain specified 
Tesponse distributions by collapsing responses. It reformats the 
data by combining responses and produces a transformed data 
deck. Program 1 may then be rerun to produce response distri- 
butions for the transformed data. 

For regrouping responses by methods other than simple col- 
ee 


J' The authors wish to thank Miss Inga Regnall for her help in gathering 
Pilot data for use in program development. Appreciation is also extended to 
Drs. Ronald E. Shor and Leslie А. Fox for critically reviewing an earlier 
draft of the paper. 
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lapsing, it is suggested that the reader use the “Index Construc- 
tion and Recoding Program” in the OSIRIS II package (1971). 
Program S—Program 3 selects homogeneous groups based on sim- 
ilarity of responses to a critical test subject. Amount of similar- 
ity is defined as the number of response items of each individual 
subject which are identical to the corresponding response items of 
the critical test case. Critical test case subjects may be separate 
from the subject pool—as in the case of matching students to 
teachers. Otherwise every subject in the data pool is used as a pos- 
sible critical test subject. 

Matching may be done either with or without replacement; 
that is, subjects may be removed from or returned to the data 
pool after being used as a critical test case. 

Program 3 produces the following outputs: 


a. а listing of the critical test cases and subjects which were 
successfully matched to them; 

b. а matrix of similarity (percentage of agreement) between 
critical test cases; 

в. if matching with replacement, a listing of subjects used, the 
the number of times matched, and the groups into which the 
subjects were placed; 

d. a listing of subjects for which matching was not possible. 


These programs can also be used to analyze non-Likert-scale 
data. The user must run Program 2 to convert his data to а 
Likert-scale format. This procedure will allow the comparison of 
subjects in terms of such characteristics as IQ scores, objective 
test scores, or personality inventories such as the MMPI. 


Availability 
Further information and copies of the programs can be ob- 
tained from the first author at the following address: College of 


Liberal Arts, Dean’s Office, Murkland Hall, University of New 
Hampshire, Durham, New Hampshire 03824 
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A FORTRAN PROGRAM FOR THE CONVERSION OF 
ІВМ 1230 OPTICAL SCORER PACKED CARDS! 


T. B. ROGERS 
University of Calgary 


Tue conversion of IBM 1230 Optical Scorer packed cards is 
usually performed using a program written in a basic machine 
language such as Assembler or Map. To the casual computer user 
this often poses difficulties, as he is generally forced to recruit & 
system-trained person to write this portion of the program for 
him. The problems inherent in such а procedure, particularly with 
reference to dictionary or array matching, often lead to the 
emergence of “one pass" programs (i.e., Games, 1970) that tend 
to commit the user to a specific strategy of data analysis. While 
these “one pass” packages are extremely useful, there are often 
times when a researcher would prefer to analyze his data in a 
manner different from that which the existing package permits. 

The present program has been developed to introduce a further 
degree of flexibility into data management. The program is cur- 
rently written in FORTRAN G (IBM 360, but quite adaptable) 
which permits easy integration into other FORTRAN routines. 
The packed cards which show two responses per column on each 
card are repunched into standard output format with one re- 
sponse per column. The program, which was initially written to 
process dichotomous data, can readily be modified to include any 
number of response alternatives. 

The basic strategy of this approach lies in deciphering, em- 


1The computer time and data used in the development of this program 
were made possible by a grant to the author from the Canada Council. The 
author’s mailing address is: Department of Psychology, University of Calgary, 
Calgary 44, Alberta, Canada. 
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рігісаПу, the response code used by the optical scanner. In the 
IBM 1230, the packing is done such that the responses to items 
1 and 2 are coded in the first data column, the responses to items 
З and 4 in the second data column, ею. The program reads the 
packed bi-punched data points in alphameric and compares these 
to a list that is read in as a parameter card. A second list, which 
tells the machine which two response alternatives each alphameric 
character represents, is also read in. The program simply sub- 
stitutes the actual two numbers from this second list, for the 
alphameric packed character and punches these in a standard 
integer format. 

The code used on the author's System is shown in Table 1. 
This code was deciphered by submitting a test response sheet 
containing all possible doublets of response alternatives to the 
1230 Optical Scorer and by then interpreting the resulting card 
output to determine the alphameric equivalent of each response 
doublet. To the extent that 1230 Systems tend to vary, it is ad- 
visable for each user to check his own system’s code before using 
this approach. Once the code has been determined, it is à relatively 
simple matter to program the translation from alphamerie packed 
output to useable integer output. 

The program requires the following input: 


1. Parameter card showing (8) number of items in the test, 
(b) number of packed cards per subject, (c) number of sub- 
jects, (d) number of Tesponses represented on a full card 


TABLE 1 


Conversion Code for IBM 1230 Optical Scorer Packed Cards 
for The University of Calgary 


Alphamerie Response 
Character Deublet 


First Response Second Response 


| >< 


оон оюын 
CNORNHENE 


*0 represents the output code for an omitted response. 
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(usually the number of data points per packed card mul- 
tiplied by two). 

2. Alphameric symbol code (in one column fields) giving the 
alphameric code for each response doublet. 

3. Numeric response code giving the two numbers (response 
codes) for each member of (2), in the same order as (2). 


This program, which is quite short, is adaptable for a sub- 
routine in a larger set of programs. It also can be used as a sep- 
arate conversion routine. A listing and source deck are obtainable 
from the author. 


REFERENCE 


Games, P. A. Flexible scoring and internal consistency item analysis 
on the IBM 360 series. EDUCATIONAL AND PSYCHOLOGICAL MEA- 
SUREMENT, 1970, 30, 141-142. 
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А САТВ CONVERSION PROGRAM 


GARRETT MANDEVILLE AN» ANTHONY DEMAIO 
University of South Carolina 


OCCASIONALLY a researcher will find himself in the position of 
having data on punch cards in raw score form, whereas it is his 
intention to analyze standard (converted) scores; while the con- 
verse situation could also arise, it is less likely. 

A partieularly troublesome example in which the writers were 
recently involved was а situation in which the 12 subtest scores 
for the General Aptitude Test Battery (GATB), Form B, were 
available, but the nine aptitude scores were to be analyzed. The 
difficult nature of this particular problem is due to the fact that 
some of the aptitude scores are obtained from more than one sub- 
test (raw) score—which results in 15 (converted) aptitude sub- 
scores. The 15 subscores are then reduced to nine aptitude scores 
by obtaining the appropriate sums. Having gone through this 
exercise in programming which, although quite simple, might 
cause the novice some problems, the writers wished to mention the 
availability of this program. An advantage which this program has 
in eomparison to most conversion programs is that the conversion 
tables for the form used (A or B for IBM answer sheets) are 
generated within the program, so that the inconvenience of having 
this information on data cards and the possibility of these cards 
being improperly ordered are eliminated. 


Input 


A single control card which specifies which form (А or B) is in- 
Volved, whether the variable format option is being taken, and 
What the various output options are heads each set of cards to be 
processed. If the variable format option is taken, this card is next 
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and the first field is assumed to be an integer field which contains — 
the subject identification (I.D.). The data are stored in a oat Я 
point mode. Blank cards sufficient to be read using this format | 
signal the end of the deck. The program automatically тесус 
until it finds an extra blank card. 


Output 


Printed output includes the subject I.D. number and each of | 
the 15 individual subscores or the nine aptitude scores in col- © 
umnar form appropriately titled. The latter output would gen- | 
erally be desired. E F 

The other output options include (a) printout of raw scores — 
for each subject, printed in natural order beneath the converted | 
scores; (b) punch card output of the 15 subscores or nine apti- _ 
tude scores (depending on the selection above); (c) printout of _ 
the means, variances (biased estimates), and standard deviations | 
for the desired converted scores; (d) printout of the intercor- | 
relation matrix for the desired scores. | | 

Instructions for punching the control card for these options - 
are printed in the comment section heading the program. 


Time Estimates and Availability 
А running time, of about 212 minutes was required on the IBM | 
7040 for a sample of 500 cases when all output options were ^ 
used. Compiling time was 152 seconds. A FORTRAN IV pro | 


gram deck, an example data deck, and printout are available from. j 
the authors at no charge. } 


Limitations 


There is no limitation as to the number of subjects which can - 4 
be processed and the storage requirements are minimal. ў 
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A COMPUTER EXAMINATION COMPOSITOR 
FOR THE IBM 360/40 


WILLARD A. BROWN 
Western Washington State College 


Ат least since 1964 various workers have used punched cards and 
IBM 407 or equivalent machines to produce examinations on dupli- 
cation masters including among others Brown (1967) and Stodola 
(1965) in separate efforts. This procedure was useful in that it allowed 
the elimination of typographical errors, speeded up test production, 
encouraged the production of more and better test items, and al- 
lowed experimentation with the basic idea of machine-produced ex- 
aminations. It had the drawbacks, however, of restricting the test 
item constructor to upper case characters and to a уегу limited set 
of punctuation characters. These restrictions were in part imposed by 
hardware and in part by the nature of operating systems available. 
The biggest drawback of all, though, was the gradual degradation 
of the test item file by improper replacement of cards in the file. 

With Western Washington State College Computer Center’s de- 
velopment of Western Washington State College Terminal Access 
Method (WTAM) and the acquisition of an IBM 360/40, the nec- 
essary environment was established for a more sophisticated exami- 
nation production system. WTAM allows the integration of (a) 
communications terminals (with upper and lower case characters), 
(b) the PL/I programming language, and (c) high speed large scale 
disk storage. The programmer may build files (programs, Job Con- 
trol Language ‘JCL’, and Text, or concatenated combinations) from 
remote terminals. Through use of the above environment, the Ex- 
amination Compositor (BR4EXAMS) was developed. The GSS sys- 
tem (AGT Management Systems, Inc.) has been substituted for 
WTAM from which GSS was partially derived. 
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Program Objectives 


The program BR4EXAMS performs the following services: 


1, Stores examination questions of the objective type on magnetic 
disk, eliminates non-essential blank spaces, and performs blocking, 
independent of question length, in such a way as to avoid wasting 
storage space. 

2. Alters the file to add new questions, to replace old questions that 
need modification, or to change the recorded expected answer. 

3. Produces a catalog of the stored questions on either the high 
speed printer or the remote terminal. Using the high speed printer is 
the more economical method; however, all characters are translated 
to upper case. 

4. Produces page files for examinations to be printed on a communi- 
cations terminal using both upper and lower case characters. This 
feature may be used for the production of mimeograph or ditto mas- 
ters or for direct use by a student. Questions do not span two pages. 
In addition, an instruetor's copy with associated answers is pro- 
duced in upper case alone on the standard printer. 

5. Produces a separate page file of all the expected answers in a 
format corresponding with the standard IBM answer sheet. 

6. Provides for examinations using matching questions, multiple 
choice questions, true-false questions, essay questions, tables, or 
blocks of test items with a common text or any combination of 
these. 

7. Allows absolute formatting of tables and textual materials any- 
where in the examination. The symbol 7in the first column of a line 
or the expression # ¥ on a line preceding a table or text will override 
the program’s formatting features. When the absolute format feature 
is used, the content of the first column does not appear in test print- 
outs. With this one exception, there are no limits on the use of 
special symbols in questions. 

8. Determines line indentation, the number of whole words that 
can be accommodated on a line, and the number of questions that 
can be accommodated on a page. The default parameters for this 
service may be easily overridden if the user desires, through infor- 
mation supplied to the program on the “function” card. The typist 
who enters the questions into storage may ignore format. 
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Files 

The files used are sequential, generation files on disk. The use of 
generation files greatly increases the ease of modifying the content 
of the question file. When a new generation is produced, an older 
generation is deleted. One or more generations may be retained, but 
single generation storage would be the least expensive; card back-up 
in this case is a necessity. 

A program for the IBM 360/20 is available that allows lower 
case characters to be interpreted on the card in upper case for ease 
in editing the card file. For economical use of disk space, storage is 
continuous, and questions lap over from one record to another with- 
out regard to boundaries. The IBM2314 disk system is used; each 
disk pack has a capacity of 80,000 or so questions. 


Terminals 
The terminals are IBM2741 of the selectric ball type. This allows 


the use of many type-fonts including foreign language special char- 
acters and all the characters available on a standard typewriter. 


PL/I Language 


The program is written in the PL/I language. This language, 
which allows some degree of machine independence, has features 
necessary to the problem that are not available in other compiler 
languages such as FORTRAN. 


Typical Operation 
Loading a new question into an existing file. 


//STEP1 EXEC PGM=BR4EXAMS,PARM = (‘SOURCE ="TERMINAL” NN =7000') 
Баса 
Баа 

//РАСЕ01 рр DSN =F0002.S0URCE(PAGE01),DISP =OLD 
Кы 


ЖАО. 

//PAGE10 DD DSN =F0002.SOURCE(PAGEI0),DISP =OLD 
//SYSIN DD * 
FUNCTION =‘OLD FILE’; 
#24 
The name of the game is 
(Checkers. 
(2)Computers. 
(Sheen Choa: 

/ 


The above combination of JCL and text may be stored under a name such as EXAM and placed 
in the job queue by the command RUN EXAM. 
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Permissible Requests 


A. Initiating a new file: FUNCTION ='NEW FILE’; 
В. Adding to old file: FUNCTION ='OLD FILE’; 
С. Correcting an existing question: FUNCTION -'REWRITE'; 
D. Correcting existing answers: FUNCTION —-'ANSWERS'; 
E. Requesting a catalog on the terminal FUNCTION ='CATALOG', DEVICE = 
starting with the 99th question: "TERMINAL',ORIGIN =99, QUESTIONS =100, 
F. Requesting a catalog on the fast FUNCTION ='CATALOG’; 
G. Requesting an examination: FUNCTION —'TEST'; 


‘151, 152,153,154',2,10,3, ‘15,20,21 

(the quotation marks delineate questions that have 
common stem text or have other characteristics 
in common) 

ГО 


Storage Format 


Special symbols are generated and used to delineate questions 
and their parts in storage. These symbols, which appear as blanks 
on the printer, are eliminated in the terminal files. In the following, 
the symbol & and Ж will indicate such symbols. Questions are stored 
in the following format. 
€502020340015éThe name of the game is Е (1) Checkers, 

E(2) Computers. (3) Chess.(4) Cheese. £4022304£ etc. 

The first 14 characters are reserved for item performance data, the 
answer and the catalog number. In this instance 50 is the ease, 20 
is the validity, 2 is the answer. The 03 indicates the number of times 
the question has been used, and the 0015 is the catalog number. 
The inclusion of performance data has been left for future imple- 
mentation. 


Future Implementation 


Brown (1967) reported an assembler language examination anal- 
ysis program for the IBM 1620 computer. Since that date the ideas 
of the program have been reprogrammed in PL/I language and its 
capacity and sophistication considerably extended. BR4EXAMS 
and the analysis program will be tied together by over-lay program- 
ming so as to preserve performance data with each of the stored 
questions. 
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А COMPUTER PROCEDURE FOR THE ITEM ANALYSIS 
OF A MULTIPLE CHOICE TEST 


YOUNG B. LEE, WILLIAM B. MICHAEL ax» ROBERT A. SMITH 
University of Southern California. 


Purpose. It is the writers’ intention to provide a computer proce- 
dure to be used for the item analysis of multiple choice items. The 
program is written in FORTRAN IV language for processing by an 
IBM 360 computer. 

Procedures, Three control cards are needed to execute the pro- 
gram: 


(1) Problem card—number of total items in the test and number of 
alternative responses for the item to be punched in 213 format. 
_ Comments could be punched on column 7 through column 78. (2) 
_ Format Card—F-type format card, which describes the input format 
(Example: 10F1.0). (3) Key card—correct answers for each item 
to be punched necessarily conform to the format card. Data are 
followed by these three control cards. 


Output. Output of the program provides item difficulty, standard 
deviation, reliability index of each item, point biserial correlation 
between the item and total test score, as well as the percentages of 
alternative responses made on each item. 

As a supplementary section, the following data are included in the 
computer output: (1) the actual list of input data (item-score ma- 
trix), (2) individual raw test scores, (3) individual’s score corrected 
for so-called guessing, (4) number of items each individual omitted, 
and (5) rank ordered list of individual scores. 

Availability. The program is available from the first mentioned 
_ author (at the Department of Educational Psychology, University 

of Southern California, Los Angeles, California 90007). 


203 


BOOK REVIEWS 


MAX D. ENGELHART, Editor 
Duke University 


DENNIS M. ROBERTS, Assistant Editor 
East Carolina University 


Bailey’s Probability and Statistics Models for Research, 


JAMES А. WALSH ..... иен ние ttn 207 
Bishir and Drewes’ Mathematics in the Behavioral and Social 
Sciences, PETER А. TAYLOR ................ mmn 208 


Dayton’s Design of Educational Experiments. PETER A. TAYLOR 210 
Guilford and Hoepfners The Analysis of Intelligence. A. 


RALPH HAKSTIAN sls lase so sis ai neag ojee оноо 211 
Knapp’s Statistics for Educational Measurement. Lewis R. 
AIKEN, JRL O ВЕ ке AT A 215 
Lehmann and Mehrens’ Educational Research: Readings in 
Focus. GERALD M. GILLMORE „неее nn nn ng 216 
Stufflebeam and others’ Educational eges ts and Decision 
Making. CARL А. CLARK „еее denen 219 
Summers’ Attitude Measurement. S. B. KHAN .............. 223 


Tatsuoka’s Multivariate Analysis: Techniques for Educational 
and Psychological Research. ROBERT К. GABLE AND ROBERT 
M. PROZER о ои си: оно бое и ee re Nis Mae 225 


Tryon and Bailey’s Cluster Analysis. NORMAN CLIFF ........ 228 
Tyler’s Tests and Measurements. GLENN W. DURFLINGER .... 281 


| 206 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Dr. Dennis M. Roberts of the Department of Psychology, Е 
Carolina University is replacing Dr. Henry Moughamian of 
City Colleges of Chicago as Assistant Review Editor. We are шие 


and psychological measurement, statistical methods applicable 
educational research, and on applications of computers. 


Daniel Е. Bailey. Probability and Statistics: Models for Research. 
New York: John Wiley & Sons, 1971. Pp. xviii + 686. $12.95. 


Bailey's book is one of the most extreme examples to date of а 
fast-accelerating trend in applied statistics texts. The tendency is 
primarily away from encyclopedias of methods and toward ap- 
proaches that integrate theory and methods. Bailey spends 16 
chapters and two-thirds of his pages developing ideas or prob- 
ability and the philosophy of statistical inference before introduc- 
ing actual applications. Because of this early investment in theory, 
the final chapters on the one-sample problem, two-sample problem, 
many-sample problem, regression and correlation, and Pearson chi- 
square do not need to be tied together with the usual makeshift 
conceptualizations based on superficial aspects of the data. Instead 
they are linked naturally by means of the probabilistic concepts 
that form their common base. 

One of the most rewarding aspects of Bailey's text is the excel- 
lence of his treatment of probability. He begins at the beginning 
with ideas of sets and proceeds up through density functions and 
continuous variables with a minimum of question-begging and 
flim-flam. There are a few important topies such as the relation- 
ship of marginal distributions to independence that are lacking, but 
on the whole the quality of the treatment (if not the depth) com- 
pares favorably with that of many of the better mathematical 
statistics texts. The cost, of course, is in number of pages. Rather 
than the elegant condensation that a mathematical orientation 
makes possible, Bailey has laid everyting out in detail for the 
quantitatively unsophisticated reader. 

Occasionally the asides to the psychology graduate student get 
out of hand. It is unfortunate that the introductory chapter is the 
most conspicuous example of this. It is concerned with explaining 
the relationships among psychological constructs, measurement, 
and probability and statistics. It should have been brief, concise, 
and a model of clarity. Instead it is ponderous and opaque, due 
largely to Bailey’s turgid prose style. He simply does not have 
the gift of the apt phrase, although he can set out a complex chain 
of reasoning or explain a complicated diagram with the very best 
of present writers. 

Several other facets of the text deserve special mention. The 
discussion of tests of significance with respect to utility and com- 
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peting hypotheses is well-balanced and excellent. The chapters on _ 
sampling and on the one-sample problem also seem to be models of — 
precise exposition. On the other hand, the treatment of complex 
factorials and repeated measures is probably too compact and 
compressed for most students to take in without a great deal of 
amplification in lecture. Similarly, the discussion of multiple com- 
parisons is probably too general in scope to be very useful without 
a lot of teacher assistance. ; 
А partieularly valuable aspect of the book is the chapter ends: 
at the end of each chapter after number one, the author provides à _ 
list of suggested readings, a glossary, and a set of problems. The | 
problems are almost without exception pertinent and of an ap- | 
propriate level of difficulty. In addition, the glossary recapitulate _ 
the key concepts of the chapter and their definitions. The readings 
referenced at the ends of the chapters are for the most part useful - 
auxiliary materials. However, Bailey occasionally gets a bit op- 
timistie, as in Chapter 15 where he recommends Т. W. Anderson’s 
An Introduction to Multivariate Statistical Analysis. Few beginning 
students will find Anderson's highly sophisticated book helpful. 
In summary this must be rated an excellent text for advanced 
undergraduates or beginning graduate students. Its integrated _ 
approach provides a basis in theory for a clear understanding of 
statistical methods. The applied procedures are for the most part 
well and thoroughly laid out. The section on complex analysis of 
variance is somewhat compressed and the treatment of regression is _ 
unfortunately confined to the univariate case, but these are relatively 
minor faults. On the whole, Bailey’s text is a superior one, and _ 
it has a symbolic value as well. In light of current trends, it may 
represent the last step from the point where psychology and social 
science students will as a matter of course obtain a thorough- 
enough grounding in mathematics to acquire their statistical theory _ 
in mathematical statistics courses rather than from books like this. 


JAMES A. WALSH 
Iowa State University 


John W. Bishir and Donald W. Drewes. Mathematics in the Be- 


havioral and Social Sciences. New York: | 
Pp. xiii 4: 714. $1095 ew York: Harcourt, Brace, 1970. _ 


This useful volume introduces the reader to those mathematical 
methods judged by the authors to be most frequently used in the | 
social and the behavioral sciences, and could even find considerable _ 
use among management scientists. The potential audience is pre- 
sumed to have some competence in mathematics, though an in- | 
tuitive understanding could be attained by most students who are | 
uninitiated in modern mathematical techniques. Additional read- 


-- 
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ings are suggested at the end of each chapter for those who wish 
to pursue a specific topic. Because the book can be used in a 
one-, two- or three-semester sequence, it appears plenty of time 
could be allowed so that any deficiencies in mathematical back- 
ground could be remedied. " 

There are 21 chapters, each liberally supplied with problems. 
These chapters are grouped into four parts: Part I is devoted to 
Finite Mathematies (set theory, symbolie logie, relations-equiva- 
lence, ordering and graphical—functions, sequences, infinite series 
and combinatorial analysis) ; Part II is a brief outline of matrices 
(arithmetic operations, equation-solving and characteristic equa- 
tions) and Part ТЇЇ is a cursory outline of the methods of calculus 
(differentiation, differential equations, integration, power series 
and functions of several variables, including partial differentiation). 
Part IV is an introduction to probability theory with added chap- 
ters on Markov Chains and Continuous Random Variables. 

As the content might suggest, this is an impressive volume of 
mathematics to attempt in one volume. And for the most part, it 
appears that it would be necessary to supplement this text for 
the majority of education students and many psychology students. 
Yet, as a mathematician, the effort to provide these skills seems 
almost a sine qua non in our increasingly quantified sciences. 
Most graduate schools in behavioral sciences seem to rely on pre- 
vious or current contact with mathematics departments to incul- 
cate the needed skills, with one too-frequent outcome among the 
students being either a failure to pay attention because a par- 
ticular technique is not obviously relevant, or an esoteric and rigid 
appreciation of the theory without the correlative skills to apply it. 
The time has past when it has become necessary to plead for a 
familiarity with and competence in, the processes of linear algebra 
and functional theory. So much of the very heart of our bread-and- 
butter analytic tools rest upon а comprehension of matrix theory, 
graphs, and continuous functions that to attempt to function ade- 
quately without this comprehension is like daring to drive for long 
periods of time without a driver's license. One can eventually per- 
form the routines, but there is the ever-present risk of getting 
caught out. There is soon none of the risk-taking spirit left (for 
fear of exposure) and so one eventually becomes a hemmed-in 
pedestrian. Yet one has to ask where such a text as this would 
find a place in most university curricula. Certainly, special effort 
would have to be made to incorporate courses in behavioral science 
mathematics, but the pay-off results in greater use of powerful 
techniques, a willingness to experiment with new models and the 
opening of a potential interface with many other disciplines. y 

This, then, is а book that would need careful thought as to its 
location in an academic program and would probably be best 
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handled for instructional purposes by a pure mathematician. The 
point is that there are too many critical concepts being conveyed 
at a rapid speed for misinformation to interfere with the learning 
process. In too few universities would this volume find a course 
ready for it, and it is not appropriate for use as a self-study aid. 
The book is unquestionably a perfect source to refurbish erstwhile 
skills, but as a reference it isn’t well assembled. A useful book, 
but one that will probably find difficulty in obtaining an audience. 


PETER A. TAYLOR 
Department of Regional 
Economic Expansion 
Ottawa, Ontario, Canada 


C. Mitchell Dayton. Design of Educational Experiments. New 
York: McGraw-Hill, 1970. Pp. vii + 441. $10.95. 


The “Design of Educational Experiments” has attempted to de- 
seribe the common principles and techniques of experimental 
design within the specific context of educational research. The 
presumption that “specific” means “special” has somewhat watered- 
down what are really universal principles in order to make ex- 
perimental situations approximate the exigencies of the educational 
setting. The author claims to have “adapted his presentation” to 
an educational audience, but seems to have underestimated the 
шр that are put on students in the more aggressive institu- 
1008. 

Nevertheless, the attempt has been a valiant one, and there are 
many interesting designs and analyses introduced, always clearly 
and always in an easily-read format. The author’s intent was that 
this text would serve a one-semester graduate-level course, but more 
progressive instructors could well employ it at an undergraduate 
level. The great advantage of the text is its simplicity of presenta- 
pai which would make it quite appropriate at undergraduate 
levels. 

S The text attempts а coverage of “virtually all experimental de- 
signs which currently enjoy wide usage in education.” Because of 
the intended time-frame of a one-semester course, some of the 
coverage is necessarily scant, though only a purist would insist on 
more rigor. For the mathematically naive, procedural sequences are 
summarized in the text, though the printers have left it rather 
difficult to discern the summaries by running them on in the major 
portions of the text. 

None of the content is particularly startling, though the ex- 
perienced reader might find some refreshing inclusions, such 28 
Page’s L test, response-surface methodology and a too-brief ref- 
erence to multivariate designs. The mere introduction of these 
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techniques will surely spur other authors to treat them with more 
rigor and in greater detail and thus push back the periphery of 
current pedagogic boundaries. 

One might carp at the lack of precise mathematics that appears 
from time to time in order to maintain simplicity, but this oc- 
casional simplification must be presumed to be an effort to ac- 
commodate to some stereotypic education student. A great advant- 
age to the book accrues through its use of educational illustrations 
of all the techniques. 

This is a “quiet” book, containing nothing that is startling either 
in content or in presentation. However, it contains much that is 
attractive by way of content-organization, and would probably be 
quite suitable for most senior undergraduate and junior graduate 
courses, particularly for individuals whose skills are not quantita- 
tively strong. 


PETER A. TAYLOR 
Department of Regional 
Economic Expansion 
Ottawa, Ontario, Canada 


J. P. Guilford and Ralph Hoepfner. The Analysis of Intelligence. 
New York: McGraw-Hill, 1971. Pp. xiv + 514. $16.95. 


This book will be of some interest to all those concerned with 
the study of human abilities, and of considerable interest to that 
somewhat smaller group who were swept up by the senior author's 
1967 volume, The Nature of Human Intelligence. The present book, 
unlike the earlier one, in which a major theoretical position was 
enunciated and elaborated, reads like a long research paper, with 
the research reported firmly anchored in the senior author's "Struo- 
ture of intellect” (SI) model. In addition to reporting the earlier 
studies which preceded the formulation of the SI model, the authors 
Teport results of research conducted since 1967, and, ‘thus, the book 
can be considered, in part, an updating of SI thinking, with 98 of 
the 120 cells in the three-facet design now filled, as opposed to 77 
in the 1967 treatment. Clearly, this book is a “must” for those 
serious students of human abilities who wish to keep abreast of 
Guilford’s thinking in the area. The major research studies on 
intellectual abilities (with which both authors have been associated) 
for the last 20 years in the Aptitudes Research Project at the 
University of Southern California are presented in an orderly 
fashion, discussed, and summarized. The book will prove to be of 
considerably less value, however, to those who either have not been 
persuaded by the SI organization of abilities or are just beginning 
the study of human intelligence. No theoretical refinements over 
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the 1967 statement are apparent and results based on experimental 
procedures laden with subjectivity are frequent. 

The contents of the book fall into roughly three categories. In 
Chapters 1 through 4, some background material is presented. A 
very brief and inadequate treatment is given of earlier research 
on abilities, including that conducted by Spearman, Thurstone, 
and others, and this material is followed by a section describing 
the origins and activities of the Aptitudes Research Project. Next, 
the SI model is reviewed, and some implications of this organization 
of abilities are noted. Rounding out this first section on preliminary 
considerations are two chapters devoted to certain factor analytic 
methodological issues. 

It is likely that the methodologically competent reader will find 
these two chapters on analysis procedures somewhat distasteful. A 
section entitled “Indices of Factorial Invariance,” for example, con- 
tains a poor presentation of such indices, with several omissions 
and inaccuracies. A section on “Factor Extractions” reveals that, 
in general, more factors were retained in the subsequently reported 
analyses than would be by the vast majority of experienced factor 
analysts. Such criteria as the number of factors associated with 
latent roots of R — U? greater than .30, and the number of com- 
mon factors accounting for 95% of the total communality were 
employed, helping to account for the unusual number of singlet 

‘common” factors in the analyses reported later. Undoubtedly 
most distasteful of all, though, is the great dependence—in the more 
recent studies at least—on the Orthogonal Procrustes procedure 
(Schénemann, 1966) for rotation, a technique the authors refer 
to as "targeting." With this procedure, unrotated factor loading 
matrices are fitted, orthogonally and in a least-squares sense, 
lo а target matrix which presumably embodies the hypothesis to 
be confirmed or disconfirmed. It has been pointed out by Horn 
(1967, 1971), however, that, given a large ratio of retained factors 
to variables, and а small number of marker variables per factor 
(both conditions are certainly met in the studies reported in this 
book), virtually any set of data can be fitted to any hypothesis. 
Knowing this fact, the reader may well consider tenuous, conclusions 
resulting from such subjective and potentially biased procedures. 

The substantive findings from some 31 orginal factor analyses, 
conducted between 1950 and 1969 in the Aptitudes Research Pro- 
ject, as well as from reanalyses of the original correlation mat- 
riees, comprise the second and largest category of topics in the 
book. In Chapter 5, studies of abilities in reasoning and problem 
solving are presented. These abilities are seen as falling mainly 
into the cognition and convergent production cells in the SI сиђе. 
Chapter 6 deals with abilities in creative thinking and planning, 
and, as one would expect, the studies reported are used mainly to 


BOOK REVIEWS 213 


provide evidence for the existence of divergent production abilities. 
Chapters 7 and 8, concerned with evaluation and memory abil- 
ities, respectively, round out the treatment of abilities in terms of 
the operation facet of the cube—all five hypothesized operations 
having been considered. Finally, Chapter 9 is devoted to an SI 
content category, behavioral abilities. It is in this area, largely 
identified with social intelligence, that we find most of the remain- 
ing unfilled cells—only 12 of the possible 30 abilities having been 
“demonstrated” at this time—and it is this area that has tradition- 
ally been excluded from the study of intelligence. The reader, thus, 
will likely find this chapter interesting; in addition, some clever 
tests to assess the constructs are described. 

A third section, in which the SI model is placed in the larger 
educational and psychological context, concludes the main body of 
the book. Chapter 10 contains results of analyses of correlations 
between various SI abilities and selected external criteria. Among 
the analyses reported is one in which divergent production abilities 
were related to both teacher ratings of creativity and IQ as mea- 
sured by the CTMM. Although Guilford has stressed that the 
divergent production abilities in the SI model represent the im- 
portant components of creativity and tend to be somewhat inde- 
pendent of IQ, the results presented, with seventh grade students, 
include higher correlations, on the average, between the divergent 
produetion abilities and IQ than between the former and teacher 
ratings of creativity. Also reported is a predictive validity study 
involving 15 selected SI abilities and grades in 10 subjects at the 
U. S. Coast Guard Academy. Of the resulting 150 correlations, only 
31 were significantly different from 0 at the .05 level (one of the 
31 was negative), and of these 31, the median r was only 28. Not 
one of the 15 SI abilities correlated significantly with grades in 
either algebra and plane geometry or communications. The reader 
must, therefore, be cautious in attributing much predictive efficacy 
to these logically derived SI constructs. Chapter 11 contains some 
general conclusions drawn from the research reported regarding 
factor analysis as a theory-building technique, the place of the 
SI model in general psychological theory, and implications of 
this model for intelligence testing. а 4 

Two appendices follow Chapter 11. The first is a list of the 
entire set of reports from the Aptitudes Research Project. Ap- 
pendix B, comprising 130 pages, contains a complete list of all 
tests employed in the research reported and a sample item from 
each. Those interested in detailed study of human abilities and 
particularly those also involved in test development will find this 
appendix of great value as a reference source. A list of works cited 
follows Appendix B, and this, in turn, is followed by name and 
subject indices. 
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Regarded in its entirety, this book leaves several negative im. 
pressions. First, and least important, is the lifeless prose; most _ 
will likely find the book, in general, monotonous reading. More |, 
importantly, though, the content suffers too. It is difficult to im- _ 
agine those concerned with the nature of human intelligence being d 
satisfied, upon reflection, with the notion that intelligence is most | 
parsimoniously and efficaciously described in terms of 120 ortho- _ 
gonal dimensions. One suspects that the tendency, noted earlier, | 
towards overfactoring may well have contributed to a substantial - 
fragmentation of previously noted, broader, albeit unitary, сопе _ 
structs, and the alert reader will surely speculate about whether 
the 98 SI abilities currently “demonstrated” would prove empiri- | 
cally to be even nearly mutually uncorrelated. Most serious, 4 
though, are the previously noted hypothesis confirmation techniques ^ 
employed. Although the authors appear to believe that the research _ 
reported in this book constitutes empirical support for the SI ро 
tion, the apparent support is illusory. The student of human abil - 
ties may be expecting too much to demand—as many logical _ 
empiricists would—findings robust enough to emerge from analyses _ 
using a variety of techniques (see Harris, 1967, for a discussion. 
of varying techniques in a factor analytic context), but he has _ 
the right to require that it be possible for hypotheses and theories 
tested to be disconfirmed. The procrustean techniques employed in _ 
the studies reported in this book all but preclude this possibility. | 
The SI model, thus, has gained little if any empirical status on the | 
strength of the reported research, and remains what it has always | 
been—merely a logical organization of the field. "d 

In summary, The analysis of intelligence represents а compre- _ 
hensive and up-to-date discussion of SI abilities and the relation- E 
ship of other intellectual and non-intellectual traits to the SI model. 
The book is, thus, narrow, offering as it does, little view of al- 
ternative thinking about abilities; Cattell's (1963) well-received | 
theory of fluid and crystallized intelligence, for example, is not 
mentioned once. The strength of the book would seem to revolve _ 
around its great detail and its comprehensive treatment of assess- - 
ing—at times ingeniously—performance on a large number of ex | 
tremely narrow and specific tasks. It is concluded, however, that _ 
the book ean hardly be considered to make a substantial contribu- a 


tion to either the understanding of human intelligence or the — 
techniques useful for its analysis, ] 
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Scranton, Pa.: International Textbook Company, 1971. Pp. ix + 
88. $1.95 (paperback). 


This little book, designed for students in educational and psy- 
chological measurement courses who have not had statistics, is 
packed with information and procedures of a statistical nature. 
Therein lie both its strengths and weaknesses. It is strong because 
it is generally well-written and concise, but fairly comprehensive. 
It is weak because it moves too quickly for its intended audience, 
omitting certain topics (viz., general Spearman-Brown formula, 
factor analysis, computation of percentiles, percentile ranks, Т 
scores) that are important to an understanding of measurement 
theory. In addition, it includes certain topics of relatively minor 
importance (relationship between standard deviation and range, 
the Goodman-Kruskal gamma statistic) and dwells too long on 
esoteric topics such as the meaning of probability, while giving 
little attention to item analysis, construct validity and scales of 
measurement, There is only a footnote reference to generalizability 
theory, and no mention of the important contemporary topic of 
domain sampling. 

The book consists of eight short chapters (1. The Measurement 
Problem, 2. Probability, 3. Basic Statistical Concepts, 4. Objec- 
tivity, 5. Reliability, 6. Validity, 7. Item Analysis, 8. Sampling 
and Tests of Significance), plus an appendix of statistical rules 
and derivations. It is more of an educational than a psychological 
measurement text, and focuses on concepts and formulas rather than 
examples and computations. The author's style (first person singu- 
lar!) 18 engaging, but in several spots I felt short-changed by his 
curtailed discussion. Too often it seemed that we were just warming 
to a topic, only to stop abruptly and go on to another. д 

The more serious question concerning this book is where it can 
be used. My own bias is that the statistics required for under- 
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standing a basic measurement text should be incorporated into 
the text itself or at least an appendix. But being a realist, I 
observe that certain popular authors in this area do not agree with 
me. Therefore, Knapp’s book could serve as a useful supplement in 
testing courses where some knowledge of statistics not included in 
the main textbook itself is assumed. Having recognized a need for 
a book such as this, the further question arises as to whether 
assigning this particular book would be as effective as requiring 
the student to consult a more thorough discussion in an elementary 
statistics book. That question, dear reader, you will have to answer 
for yourself after comparing Statistics for Educational Measure- 
ment with other options available to you. 


Lewis К. AIKEN, ЈЕ. 
Guilford College 


Irvin J. Lehmann and William А. Mehrens (Eds.). Educational 
Research: Readings in Focus. New York: Holt, Rinehart, and 
Winston, 1971 Pp. xiii + 460. $6.95 (paperback). 


Tn this reviewer's experience, there are three prevalent and recur- 
ting shortcomings in the plethora of books of readings in education 
and the social sciences. First, the collection of articles are typically 
classified into several sections, each of which is preceded by an 
editorial discussion of the ensuing contents. Unfortunately, these 
introductions are usually composed of summaries of the articles, 
one after another, with little or no integration. They should rather 
provide the potential reader (usually a student) with a structure 
or framework which he can use to guide his reading of the articles 
which follow. Second, the articles themselves are of varying levels of 
difficulty, typically with more in the “too difficult for students” 
category than any other. Finally, seldom are the articles followed 
by a device to enhance the achievement of the book’s stated purpose, 
such as study questions or highlights worthy of thought or discus- 
sion. Editorial comments of some sort should be used to tie the 
articles back into the framework provided by the introductions. 

The result of books displaying these common failings is a col- 
lection of articles rather than a unified pedagogical device. In terms 
of teaching, the only essential advantage the type of book des- 
cribed above affords an instructor is that he does not have to send 
students to a library to fight over the subset of articles in the 
book which he feels are appropriate for his class. He simply can 
ask them to buy the book. 

Unlike “run-of-the-mill” books open to the criticisms delineated 
above, Educational Research: Readings in Focus, edited by Leh- 
mann and Mehrens, incorporate features which make it impres- 
sive on all three counts. First, the introduction to the book as 8 
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whole, and to each of the five chapters, is lengthy and goes beyond 
merely summarizing the subsequent articles. (Lehmann and Mehrens 
have left the summarizing to the abstracts by the contributors.) 
The integration these introductions provide is in terms of research 
methods rather than the specific content of the articles—which, of 
course, is the raison d’etre of the book. 

The articles themselves are generally well-chosen. In defense of 
those who have been less successful, selecting research articles to 
illustrate a method is undoubtedly an easier task than selecting 
articles to illuminate a substantive area. However, the editors ap- 
parently went to some pains in their selection. Specifically, they 
stated that to be included, an article should be of intrinsic in- 
terest to educators, reasonably recently published, understandable 
to beginning researchers, above average in quality, and diverse in 
content (p. ix). On the whole, they achieved these goals, although 
more emphasis could have been placed on the fact that under- 
standing of most of the articles has as a prerequisite a knowledge of 
basic statistics and a few of the articles require an extensive 
background (including analysis of covariance and principal com- 
ponents analysis with orthogonal rotations). 

Each article is followed by a list of questions and occasionally 
a statement by the editors pointing out when an article as a whole 
or in some aspects was felt to be particularly good. The questions 
focus the reader on important aspects of the study. They also serve 
as an integrating force, referring back to points made in the pre- 
ceding introduction. The questions and statements could also 
easily serve as a take-off point for in-class discussions. У 

The introductory chapter in the book is divided into two sections: 
the nature and purposes of educational research and the evalua- 
tion of educational research. The latter section is particularly good. 
Each aspect of the research process, e.g., the problem, hypotheses, 
eic. contains an accompanying list of questions which the con- 
sumer of research should be asking as he reads. While the editors 
do note that not all questions they pose are relevant for all 
research—most are. These lists of questions would seem to provide 
a beginning, or a not so beginning, student of educational research 
some cognitive structure to use in mastering a complex area. 

Articles within the book are placed into five chapters based on 
the type of research represented: historical, descriptive, correlational, 
casual-comparative, and experimental. Preceding each chapter is an 
introduction in which the particular type of research is character- 
ized, and some commonly associated problems are discussed. A 
list of questions, which are similar in kind to those found in the 
introductory chapter, but specific to the type of research under 
discussion, is also provided. 1 

Articles were chosen which display controversies in three in- 
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stances. In one case, an article by one researcher is followed by 
а critique by another. In another case, an article is followed by a 
critique which is, in turn, followed by a rejoinder by the author- 
of the original article. Finally, the existence of the “Experimenter _ 
Bias Effect” (Rosenthal, 1966) is the subject of debate in four - 
articles. : 
The book is terminated with an epilogue which contains a brief 
introduction “setting-up” the reader and a short article by Sipay 
(1965). In the article, data are presented which support the notion _ 
that prenatal training improves later reading ability, and that one 
prenatal training method is superior to another. The reader who | 
does not “get the point” is referred back to the introductory chapter A 
where Sipay’s article is exposed as a complete fabrication with _ 
invented data. The purpose of the article was presumably to illus- 
trate the gullibility of educators. 


While there is a place for clever satire, this reviewer wonders if EC 


a point is being made which should not be made. Questioning а 
researcher's logie, his analysis, or his conclusions is a legitimate 4 
and important enterprise. However, within empirical science, an | 
article of faith must exist which states “thou shalt believe the _ 
experimenter collected his data the way he says he did and that _ 
the data he presented are the data he collected, unless you have _ 


pretty strong evidence to the contrary.” The consumer of research | 


should feel a basic trust in the honesty of the reseacher. Further- | 
more, he cannot validly disbelieve actual reported data at his dis- | 
cretion. Otherwise, a consumer could choose to believe data which 
were congruent with his biases and disbelieve data which were nob, _ 
and thus nullify the great advantage of the scientific method. i. 
, In the preface, Lehmann and Mehrens stated that “. . . this book _ 
is more likely to serve as a supplement to one of the standard _ 
introductory research texts” (p. vii). If they are referring to a І 


course for undergraduate or first-year graduate students, this re- | 


viewer would concur. However, the book would seem to be an ideal _ 
text for an advanced educational research course, with knowledge of _ 
basic statistics a prerequisite and some research experience pre- | 
ferable. It would not be suitable in any context for students with _ 


no statistical background, for the editors make no attempt to teach | 


statistics but wisely concentrate on teaching educational research. ^ . 
For an overall evaluation, Educational Research: Readings in | 
Focus comes close to being exemplar. Its greatest strength does 


not lie in the individual articles, but in the unitary nature of the | 


book. Indeed, Lehmann and Mehrens have shown that a book of 
readings can be more than a collection of articles. i 
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Daniel L. Stufflbeam and others. “Educational Evaluation and De- 
cision Making.” Phi Delta Kappa National Study Committee 
on Evaluation. Itasca, Illinos: Е. E. Peacock Publishers, 
Pp. xxviii + 368. $7.50. 


“Evaluation is seized with a great illness,” the authors say; and 
after describing the illness, they present what they believe is the 
cure. First, there are the symptoms: the “avoidance symptom,” 
shying away from evaluation; and there is the “anxiety symptom,” 
the “skepticism symptom,” the “lack of guideline symptom,” and 
the “missing elements sysmptom," the “no-significant-difference 
symptom.” This last symptom is the surprise in the list. If an 
evaluation finds non-significant differences in a study the authors 
believe that the methodology should be questioned. To quote, “When 
a technique continually produces findings that are at variance with 
experience and common observation, it is time to call that tech- 
nique into question.” If such is the case, perhaps the avoidance 
symptom is really a good thing. Why bother with an elaborate 
evaluation if you will discard the results when they differ from 
“common observation.” 

To state, as the authors do on a preliminary page, that the pur- 
pose of evaluation is to improve, not to prove, does not mean, or 
should not mean, that whatever you try is good and the only ob- 
jective is to make it better. And even when it does, the question 
as to whether or not it is being made better is always crucial. To 
answer such questions scientific metholology, including tests of 
significance, has been developed because it has been found that 
experience and common observation are so often wrong. 

But what the authors are really concerned with, is not evalua- 
tion as the term is ordinarily understood. What they are really 
concerned with is a theory of educational programming—how to 
set up a system of steps or a project to produce educational change. 
This can be seen in their re-definition of evaluation. A 

“Educational evaluation is the process of delineating, obtaining, 
and providing useful information for judging decision alternatives.” 

Evaluation in the sense of measurement and of judging the value 
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of what has been done is thus relegated to a secondary function: 
secondary, though contributory, to the shaping of a given educa- 
tional process. They present a “theory” of educational planning and 
implementation as involved in educational decision making, and in 
this respect the book has a great deal to offer, as one would ex- 
pect from a group of experienced educators who have engaged in 
important educational projects. 

Besides the initial chapters on present evaluation and the de- 
finition of evaluation, there are chapters on educational decision 
making, criteria of evaluation, administrative problems, evaluation _ 
methodology, evaluation types, strategy, organization and admini- - 
stration of evaluation units, education of evaluators, and a final | 
“overview” chapter. \ 

Careful reading of the book should give to anyone interested in _ 
educational projects much insight into the complexities involved, - 
the things to be thought of, the need for careful planning, includ- 
ing the need for communication at different levels with groups of 1 
persons concerned: parents, community, pupils, administrators, de- — 
cision maker. Whether or not an “evaluator” reading the book may | 
attempt to incorporate in his own system much of its metholology, 
there are likely to be some things of value that he could use. 

It is to be hoped, however, that a person preparing to undertake | 
ап evaluation will overlook some of the research viewpoints pre- 
sented in the book. An example of such а viewpoint was given in 
the "no-significant difference syndrome." The rigorous research 
model, the authors point out, belongs in the laboratory, not in the _ 
"real world” of the school and classroom. It might be pointed out 
in contradiction that the classroom set-up is much more like à - 
laboratory than like the real world the classroom is presumably 
helping the child adjust to. If success of a project implies success | 
in classroom procedures, then laboratory of “classical” research de- _ 
sign procedures are appropriate- since they, or less precise and con- | 
trolled versions of them, are similar to many regular classroom 
procedures. 

"There is much objection to the randomization required in many 
research designs for estimating the effects of а new method ог 
treatment. The implication is that such randomization is immoral 
and unethical. Just why they believe it to be immoral or un- 
ethieal they do not state. Presumably it is because the potential 1 
benefits of what is being tried out are withheld from the control | 
group. If that is the reason, it means that the experimenter is pre- 
judging the results of his new methodology before it is put to the 
test. If the treatment being proposed actually is no more effective 
than the pre-existing one—or is possibly less so—then the waste of 
money and time, as well as other possible losses—may very likely | 
be obscured by the lack of an adequate evaluative design involving 
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randomization. Avoidance of such ап adequate design, therefore, 
is more vulnerable to the charge of immorality or unethicalness 
than is its adoption. 

In spite of the bias of the authors against “traditional” experi- 
mental design with its randomization requirement, the authors sug- 
gest a type of design in which children are assigned at random 
(this is one of the inconsistencies in the book) to experimental 
and control groups, but the teacher determines the treatment each 
child is to receive in accordance with the child’s individual pre- 
sumed needs. The measurements at the end of the cycle for each 
child then are categorized in terms of “success” or “failure.” If 
the experimental group then has significantly more “successes” 
than the control, the results indicate the experimental program was _ 
superior to the control. Significance is judged with a “non-paramet- 
rie" test such as chi square. Since the measure of success will be 
different for each child, success will have different and non-com- 
parable definitions. The set of criterion variables for one group 
will, therefore, be different from that of the other. Since these are 
largely determined by the teacher of the child, as they state, any 
difference could just as well be ascribed to teacher differences or 
interaction of teacher with the methods, as to the methods or pro- 
grams themselves (unless teachers are also randomly assigned). 
One could easily (but probably unconsciously) define success into 
a program by this method. j 1 

The statement that this general strategy сап be applied to vir- 
tually all known experimental designs is difficult to accept, How using 
chi square or other so-called non-parametric statistical tests, this 
strategy can fit into many of the more complex (but important) 
designs described by Lindquist, Winer and others is hard to see. 

The predilection for non-parametric statistical tests also shows 
itself elsewhere in the book. One reason given for preferring them 
is that the evaluator “. . . can avoid the problem of statistical 


assumptions by using non-parametric tests.” This statement is 
simply not true. An important assumption, and one that is often 
violated in using several non-parametric tests is that of the in- 
dependence of the measures in different categories. This assump- 
tion is particularly often violated in the use of the chi square test 
in educational settings. It is not always easy to determine when this 
assumption is mej. Another assumption in chi square is that sample 
frequencies are normally deviating around the theoretical frequen- 
cies—and this is not always easy to determine either. Randomiza- 
tion is also just as important in many non-parametric designs as 


in parametric. ; ( 1 
The authors criticize experimental designs using analysis of var- 
mptions required but because 


iance not only because of the assum 5 
they so often show no significant differences. If non-parametric 
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tests were used for the same data, however, even fewer significant Ma 
differences would be found, because these tests are less powerful 
and coarser instruments. Their more frequent use with smaller - 
numbers of cases only further decreases their power. Furthermore, — 
it has been demonstrated many times that the assumptions under- | 
lying parametric tests are very flexible—so that it is not often _ 
likely that а violation of one or more will be sufficient to invali- 
date the test in educational experimentation. | 
Though the book is weak on the measurement and research de _ 
sign function of evaluation, it has much to offer concerning the _ 
organization and functioning of an educational project. Some of its | 
value in this respect, however, is obscured through the organization 
of the book itself. ) 
There is a good deal of overlapping and repetition; one reason | 
is that the book is cross-sectional rather than longitudinal in ap- | 
proach. That is, different types and levels of evaluation are often | 
considered in the same chapter and, together, and frequently a _ 
description or analysis peculiar to one type is inapplicable to — 
another. There would likey be less repetition and more clarity if, 
after initial viewpoints, definitions, and other background materials | 
are taken care of, the different levels, from classroom to citywide | 
and regional evaluations would be individually followed through 
with the methodology and complexity peculiar to each. As it is, it — 
is often not made clear which levels are considered pertinent tO 
the material presented. 3 
"Though four “decision settings” are given, two “criterion models,” ~ 
three “decision models,” four “decision types," and four “evalua- 
tion types,” are of which has two “modes” and another three 
“strategies,” it is often not made clear how these interact. The use | 
of some 70 figures and charts does little, at best, to help clarifica- І 
tion. Neither does the application of technical sounding terminology 1 
—such as referring to the four decision-settings as the “metamor- | 
phic,” neomobilistic, “homeostatics,” and “incremental;” and bor- 
rowing terminology from information theory, cybernetics, computer 
language, and set theory, does little to provide precision and clarifi- > 
cation. For example in referring to a system of educational programs 
they state “The two-stage system has the properties of both the — 
additive and multiplicative functions of set theory." There is also | 
reference to “ratio of signal to noise,” “bits of information,” and the .— 
“bandwidth” of project reports. Ü 
The authors believe that “new style” evaluators must be trained, 
presumably in the theories and methodologies that they present. 
They believe that the few existing trained evaluators, or those in _ 
training are mainly “traditional evaluators who have, if the pre- 
sent analysis has any validity, large outlived their usefulness.” j 
However, if any such evaluators, or others, turn to this book for 
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help in setting ир an evaluation unit, they might heed these words 
from the authors: (p. 340). 

“Anyone who searches these chapters for help in setting up his 
own evaluation unit will find only a few rules of thumb, some 
incomplete case descirptions that illustrate more the fact that units 
can be established than how to go about doing it, and a few ex- 
hortations.” 

Anyone who reads this book carefully, however, will certainly 
be stimulated to think, to analyze, and to reorganize his thoughts 
about educational evaluation and decision-making, even though 
he may disagree with much that is said. 

CARL А. CLARK 
Chicago State University 


Gene F. Summers (Ed.) Attitude Measurement. Chicago: Rand 
MeNally, 1970. Pp. xviii + 568. $7.50 (paperback). 


The present book edited by Summers is a collection of articles 
on major approaches to attitude measurement. It not only includes 
reports of studies and articles which have previously appeared in 
professional journals and books but also articles which were 
specifically prepared for this volume. The book is divided into six 
sections covering self-report, direct observation, indirect or disguised 
testing, and physiological techniques of measuring attitudes. Each 
section is preceded by an overview which serves as an introduction 
to what is to follow. 

The anthology was prepared with the view of alerting researchers 
to the dangers of using only self-report techniques and to the 
desirability of using multiple techniques of data collection. An in- 
troduction is provided at the beginning with the aim to develop a 
framework for the use of multiple measurements. The tone of the 
book is set by the article of Cook and Selltiz in which five broad 
categories of measurement procedures are discussed. The remaining 
three articles in the first section deal with the problems of scaling, 
reliability, and validity of attitudinal measures. The second and 
third sections include material describing the use of self-report 
techniques such as equal-appearing intervals, summated ratings, 
scalogram analysis, and semantic differential and their modifications 
and evaluative comparisons with each other. } 

А discussion of indirect testing and performance on objective 
tasks is presented through six papers in the fourth section. Kidder 
and Campbell amplify issues and problems associated with dis- 
guised testing such as invasion of privacy, deception, and other 
ethical and moral considerations. They continue by discussing ways 
of measuring attitudes in an indirect manner in order to circum- 
vent the familiar problems of social desirability, response sets, 
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acquiescence, wilful distortion of responses, ete. and conclude with | 
a pessimistic note about the usefulness of disguised measures. The 
article by Proshansky explores the effectiveness of TAT-like pic- | 
tures for measuring attitudes toward labor followed by three _ 
articles on inferring attitudes from error and distortion in percep- _ 
tion and reasoning. The last article in this section deals with the _ 
rating of plausibility of arguments in favour of or against an issue _ 
as an indirect manifestation of attitudes. | 
The concern of articles on inferring attitudes from direct ob- 
servation of behavior is that verbal attitudes rarely reflect actual 
behavior of the respondents when faced with the real situation. 
The first two articles in this section deal with the discrepency be- 
tween racial attitudes and racial behavior. Gage and Shimberg _ 
have attempted to measure progressive attitudes of United States _ 
senators from their voting behavior on bills in the Congress, | 
Campbell, Kruskal, and Wallace have reported on an index of seat- _ 
ing arrangements for measureing racial attitudes. The last article in | 
this section by Tittle and Hill who have investigated the relative E 
efficiency of five self-report techniques (successive intervals, sum. — 
mated ratings, scalogram analysis, semantic differential, and a self- р 
rating of attitude) for predicting actual behavior. They found {i 
the Likert technique superior to all of the other methods. The final 
section of the book contains six articles on the use of various . 
physiological responses as measures of attitudes. Five articles are _ 
based on studies in which galvanic skin response (GSR) and pupil _ 
response were used as measures of attitudes toward Negroes. The | 
last article by Mueller critically reviews research which has em- 
ployed physiological techniques for measuring attitudes. | 
This reviewer finds most of the discussion in the introductory 
section of the book on definitions, data collection, scales, and re- 
liability and validity superfluous for the reason that the first | 
section includes two articles on these topics. This section would be — 
more useful if it provided a historical perspective оп the emerg- | 
ence of attitude as an important concept in social psychology and 
thus our concern with its measurement. In addition, the part on _ 
reliability and validity contains numerous incorrect statements. 
For instance, error, if different from zero, has been considered to | 
be positive. The situation where an observed scores may be an 
underestimate of the true score has not been contemplated. However, _ 
ignoring the typographical errors, the chapter by Bohrnsedt on _ 
reliability and validity makes up for some of the inaccuracies in 
the expository section. The criticism of the split-half method of | 
estimating the reliability of a measuring instrument in this chapter, 
though, is unwarranted because, in practice, а test is never randomly _ 
split into two halves since items in the two halves are usually _ 
matched on relevant characteristics. The topic of validity of mea- _ 
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surements has not been treated in equal detail as the topie of 
reliability. In the opinion of this reviewer, the discussion of cri- 
terion-related validity is especially inadequate. 

The argument in favor of using indicators other than self- 
reports comes loud and clear in the book. The extent to which the 
argument is supported by evidence on the usefulness of these 
methods for assessing attitudes is open to question, however . The 
basic premise seems to be that inferences based on unobtrusive 
measures of attitudes are more valid than inferences based on 
self-reports. In reading the reports of studies which have used dis- 
guised tests, one cannot help but feel that in the enthusiasm of 
using novel measures, numerous irrelevant sources of influence may 
have been created in the process of measurement. Moreover, 
scores on self-report questionnaires have been used as criteria for 
validation of the new approaches in most of these studies which 
weakens the argument about the superiority of these methods. Per- 
haps, is is worthwhile to mention that in assessing attitudes to- 
ward sensitive concepts, self-report techniques may be prone to 
faking and social desirability more than projective and physiologi- 
cal measures. However, there is little evidence in terms of indepen- 
dent criteria which makes on set of responses more valid as ex- 
pressions of attitudes than another set of responses. 

The addition of the invited articles to the previously published 
papers is an extremely useful feature of the book. Most of the 
papers specifically written for this book bring together information 
on the use of various methods in attitude research and provide the 
reader with a critical review of the different techniques. The 
thematie approach to the organization of the papers into sections 
and overview preceding each section are to be commended. In con- 
clusion the book provides an interesting reading for students in 
courses on “Attitudes” and for practicing researchers who are con- 
cerned with the assessment of attitudes. 

S. В. KHAN 
The Ontario Institute for Studies 
in Education 


Maurice M. Tatsuoka Multivariate Analysis: Techniques for Edu- 
cational and Psychological Research, New York: John Wiley 
& Sons, 1971. Pp. xiii + 310. $10.95 


This book is a beauty! In stark contrast to so many technical 
statistical texts, this one was indeed written with the student in 
mind. For openers, the book includes: an accurate title ; а reason- 
able, concise statement of objectives and scope; a judicious selec- 
tion of topics, coherently organized and well integrated; an orthodo- 
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dox, consistent notational system; and carefully worked numerical 
examples for each of the principal methods which are discussed. 

In the first chapter, Tatsuoka notes that mathematical pre- 
requisites should include “опе or two college mathematics courses 
and an ability to follow mathematical thinking. . 2? (p. 3). As a 
background in statistics, the author recommends having had a 
two-semester graduate course in applied statistics, with the usual 
fare offered in one of the standard texts in education or psychology. 
Although matrix algebra is extensively used, Tatsuoka includes two 
exceptional chapters on this topic which should assist much in 
acquiring the needed martrix facility. We recommend, however, that 
a person who lacks systematic knowledge of vectors and matrices 
would be well-advised to study a book such as Searle’s Matrix 
Algebra for the Biological Sciences (John Wiley & Sons, Inc., 1966) 
prior to earnest study of the present volume. 

After two chapters including an overview, the rudiments of 
matrix algebra and an abbreviated review of regression analysis, 
Tatsuoka trys gently to ease the student into multivariate pro- 
cedures and to hone his fledgling matrix skills with Chapter Three 
on the analysis of covariance. This chapter would appear to serve 
its role well, but in recognition of the discontinuity between this 
and subsequent topics, Tatsuoka has made it possible to omit this 
chapter for those who want to preoceed dirctly to real multivariate 
analysis. 


Chapter Four is entitled Multivariate Significance Tests of Group 


Differences. Many readers will especially appreciate the author’s 
discussion of procedures for establishing confidence regions for the 


population centroid when the population covariance matrix is known à 


and when it is estimated with one's sample; careful analogies are 
drawn between the univariate and multivariate cases in this and all 
subsequent chapters. An excellent use is made of figures and dia- 
grams toward the end of conveying an intuitive understanding of 
the basie multivariate ideas. We believe that this pivotal chapter 
is excellent; it might have been even better, however, had Tat- 
suoka worked in at least some of the integrative tables and equa- 
tions of O. Porebski ("On the interrelated nature of the multi- 
variate statistics used in discriminatory analysis,” The British 
Mon of Mathematical and Statistical Psychology, 1966, 19, 197- 

Chapter Five takes the reader further into matrix theory; the 
expositions on linear transformations, axis rotation and eigen- 
values and vectors are among the finest to be found in any applied 
text. The frequent exercises have obviously been chosen with care; 
working them should provide constant insights into Tatsuoka’s 
narrative, especially since their solutions are supplied in an Ap- 
pendix. Brief attention is given to certain basic problems in factor 
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analysis, and principal component analysis is considered using В, 
the matrix of product-moment correlation coefficients as a starting 
point. Some readers may find it at least a mild irritant to see 
phrases such as “reduced diagonal correlation matrix,” when re- 
ferring to common factor solutions, but this text has not been 
advertised as covering factor analysis so there seems to be little 
to complain about here. The concept of generalized inverses is 
briefly examined in a Theoretical Supplement to this chapter. For 
those important, but more-or-less ancillory topics in matrix al- 
gebra, the reader is provided with excellent Appendixes. 

Chapter Six, Discriminant Analysis and Canonical Correlation, 
is most readable and accurate, as readers might well expect from 
this author. Cognoscenti will find that they have a most enjoyable 
piece of bed-time reading here, from which they may quickly learn 
that they were not truly cognoscenti after all. We wish there had 
been a little more guidance for interpretation of discriminant 
functions. In particular, the practice of correlating discriminant 
functions with the original variables or with the so-called “error” 
portions seems to us to have been worth discussion. As it is, the 
user is left to interpret vectors of (standardized) discriminant 
weights and difficulties with such interpretations are ignored, Major 
attention is focused on significance testing, especially through ap- 
plications involving Wilk’s A-statistic. 

. The penultimate Chapter is entitled Multivariate Analysis of 
Variance. The rudiments are nicely presented and the numerical 
example is again used masterfully, including an integrated dis- 
cussion of discriminant function analysis and MANOVA. The 

- chapter is brief since Tatsuoka has chosen only to cover two- 

factor fixed effects designs in any detail. Again the A-statistic is 

most heavily used, but reference is made to the major competing 
test criteria as well as to sources which cover them in more detail. 

The last chapter, Hight, deals with Classification Problems, 
starting with an insightful discussion of the concept of “resembl- 
ance.” Again, a numerical example is used to great advantage. 
Classification problems are approached on the assumption of mul- 
tivariate normal sampling. Various (modified) Chi-square statis- 
ties are employed with and without account being made of prior 
probabilities of group membership. Wise consel is to be found re- 
garding uses of these methods and we look forward to interesting 
applications. 

'To summarize, this entire book stands as à monument to the 
author's scholarship, and to his fine pedagogical work. It only seems 
unfortunate to us that Tatsuoka decided to restrict substantially 
his selection of topics, and thus to abbreviate this book. The fore- 
going should be recognized for the compliment that it is; a standard 
of pedagogical writing so fine has been set that Tatsuoka’s present 
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elucidation of multivariate methods may ultimately be seen эв 
less significant than the model he has provided for textbook writ- 
ing. 

ROBERT К. GABLE 

University of Connecticut 


Вовевт М. PRUZEK 
State University of New York 
at Albany 


Robert С. Tryon and Daniel Е. Bailey. Cluster Analysis. New 
York: McGraw-Hill, 1970. Pp. xvii + 347. $13.50 


This book, published three years after Tryon’s death, is a des- 
cription of the BC TRY programs written under his leadership, 
their rationale, and their application to three sets of data. Empirical 
clusters of persons or variables have an appealing simplicity of 
definition compared to the abstractness of the factors defined by 
more orthodox methods. Therefore, cluster analysis is a family 
of alternatives to factor analysis which merits consideration in 
many applications. Unfortunately, the contents of the book are 
confined to the particular methods included in the BC TRY system. 
Moreover, the partisan approach taken by the book may repel 
some readers while leading others to expect that they will get more 
from the programs than they actually receive. 

The BC TRY system consists of a substantial number of com- 
puter programs either directly involved in or auxiliary to factor- 
analysis types of purposes. Some are standard factor analytic or 
statistical procedures, but the emphasis in the book is naturally 
on those more narrowly related to cluster analysis as a distinct 
methodology. The principal procedures are one for finding what is 
called in the older factor analytic literature “group factors,” in- 
cluding a procedure for analytically defining the groups; and one 
for defining the factors according to predetermined clusters; the 
latter is much like a multiple group method when applied to var- 
lables. Both kinds of procedures may be applied to correlations 
among variables (OV-analysis) or individuals (O-analysis), al- 
though the approaches are somewhat different in the two cases. 

The emphasis in the book seems more on what purportedly can 
be gotten by use of the programs than on the details of how they 
work. The reader is asked to take the methods on faith in the 
early chapters (е. g., р. 24), but the explanations when they are 
finally given seem meager. For example, the measure of similarity 
of correlational profiles which is the cornerstone of one of the main 
methods is not explicitly defined until p. 291. It was stated earlier 
in the book that the index could be negative if profiles were 
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mirror images of each other, but all the terms in the formula are 
squares, so the reader is still puzzled as to whether the formula 
is indeed the index. The greatest lack in this respect is the failure 
to present a clear development of the method for finding clusters, 
including the mathematical reasons why it must work, if it must, 
the circumstances under which it finds spurious clusters or pseudo- 
clusters when the points are not discontinuously distributed, its 
sensitivity to starting places or selections of variables, and the like. 
The reviewer was left with a distinct impression of a tendency to 
gloss over such problems. On page 304 the authors state "Suc- 
cessive samples . . . of variables are taken in such а fashion that 
V-analysis on these samples increasingly converge upon the re- 
sults of a hypothetical super V-analysis performed on all the 
variables,” but the only evidence that is presented is of an empirical 
nature, and even that is somewhat contradictory of the statement. 

The book is in other respects a good exposition of the methods, 
and the programs themselves seem to provide some rather unique 
approaches to the analysis of individual differences. The reasoning 
on which some of the details of the methods are based, however, 
is dubious in some instances, or at least inadequately presented. 
For example, the main criterion for deciding on the number of 
factors is the percentage of estimated communality explained, re- 
commending that factoring stop when 98 per cent of it is ac- 
counted for. Such a procedure puts heavy weight on having a 
method that produces communality estimates that are at least 
unbiased. Part of the basis for their recommendation of one par- 
ticular communality procedure is that the grand mean over var- 
iables and studies of communalities estimated by it agrees most 
closely with the mean of the means of ten other methods. A number 
of the decisions or inferences made about the examples also seem 
rather arbitrary. Ма i 

The authors attempt merely to give an intuitive understanding 
of the methodology, a strategy that can lead to inaccuracy. For 
example, they say that Pearson г is “an index of similarity of 
ordering individuals,” which is true in a way, but if one wants 
one strictly related to ordering, he would use one of the rank-order 
correlations. Moreover, the relation between multiple correlation 
and their indices of subgroup homogeneity (p. 230) seems to be in- 
accurately stated, as are the relations among eta, r and R. At another 
point, it is implied that annual turnover rates can be compounded 
to give the number of individuals who tempin in a cohort after a 
number of years, i.e., independence is assumed. 

The Dook does not pris: systematically to relate the methods 
to other developments in psychometric theory. For example, the 
procrustes methods and Joreskog’s restricted maximum likelihood 
procedures are not mentioned in the context of the preset cluster 
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procedure, nor is discriminant function mentioned in the context of | 
object analysis, and time-honored things like correction for attenua- — 
tion are not referred to by name. Greater integration of the methods _ 
into other theoretical formulations would have added to the book's 
worth. | 

One strategy which is described in great detail with several exam- | 
ples is the determination of variable clusters followed by defining ^ 
object (person) “clusters” in terms of arbitrary bands of profiles on 
the corresponding factor scores (one could argue with their remarks | 
on computing factor scores). There is a strong tendency for them to - 
reify the clusters so defined, treating them as distinct entities rather 
than simply multivariately defined levels. A number of the substan- 
tive inferences that are made about these “clusters” of individuals 
which would seem to be merely consequences of the fact that the 
factors are correlated, or correlated with other variables, and that the 
marginal distributions of the original variables influence the distri- | 
butions on the factors. On the other hand, such an analysis does | 
serve the purpose of reminding the more orthodoxly minded that ^ 
multivariate homoscedasticity often does not hold, and that gen- 
eralizations are consequently often too strong. б: 

There is in the book a tendency to focus on the sample rather 
than on generalization from it, a tendency which one hopes is 
anachronistic. The magnitude of the internal consistency of scales | 
derived by clustering methods is called attention to without con- | 
cern. for the fact that these clusters have been defined so as to 
maximize internal consistency in this sample. Similar statements _ 
are made about the homogeneity of the object clusters without 
regard for the degree of capitalization on chance. Moreover, the | 
practice of presenting for comparison the highest univariate 7, | 
the multiple R, and the homogeneity coefficients for the three sub ^ 
groups in which it was highest (rather than say its average over 
all subgroups) verges on the misleading. In fact, some of the in- _ 
ferences that are made seem to the reader to be rendered truly _ 
и by the amount of data-refinement that has preceded | 

em. 

The presentation of the material is generally satisfactory from _ 
an expository point of view, but there are some irritations, and а 
book with this Amount of statistical material is likely to contain | 
errors. The reviewer found the sub-heading system confusing and | 
the index skimpy, and Tables 7.3 and 7.4 seem to be intermixed. Е 

In view of the fact that this book is likely to be the main re- 
ference source for the theory and methodology of the BC TRY 
system, it seems unfortunate that it is not up the level of quality 
one would expect from these authors. н 


Norman Curr ike 
University of Southern California || 
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Leona E. Tyler. Tests and Measurements. (2nd ed.) Englewood 
Cliffs N.J.: Prentice-Hall, Inc., 1971. Pp. xii + 99. $5.75 and 
$2.75 (paperback). 


After an interval of seven years since the publication of the first 
edition of Test and Measurements by Leona Tyler, she and the 
publishers have brought out the second edition. Both editions 
are small books, the second running approximately 100 pages. 
The book is one of the Foundations of Modern Psychology Series 
produced by the publishers. Illustrative of the other outstanding 
books in this series are The Psychological Development of the 
Child by Paul Mussen, Personality by Richard 8. Lazarus, and 
| Language and Thought by John B. Carroll. Being in this series 
_ then, one perceives that the book is written for students of psy- 
| chology primarily rather than for students whose aim is to become 
_ school teachers. It is necessary to make this distinction because 
the content of a Tests and Measurements book for school teachers 
would include and emphasize some different subject content. 

While it is admittedly desirable for teachers to master the con- 
| tent of a textbook such as this one, teachers further need to under- 
| stand and master certain skills related to objectives and the mea- 
surement or appraisal of their attainment in school, the different 
methods and purposes of evaluation in the classroom, and the con- 
struction of evaluation instruments, for example. This is not said 
| in adverse criticism of Test and Measurements. Its purpose and. 
' scope are explicitly outlined in the first chapter, which is entitled 
The Nature and Function of Measurement in Psychology. Further- 
more, in the final chapter, Applications of Tests and Measurements, 
the areas covered are (1) tests and decisions and (2) tests as 
research tools. Nothing is said about tests in the classroom. Ob- 
viously, then, the book is not written to supply the need for 
teachers to understand test construction, selection, administration, 
and interpretation. 

The book is written most lucidly and interestingly, as though the 
author were standing before a class in lecture sessions. Questions 
from the listeners are anticipated and answered. Relationships of 
the subject material to applications and uses are properly made. The 
| historical backgrounds of the major topics are described and serve 
to enhance the reader’s interest. Illustrations of such historical 
material are found in the chapters on statistics, intelligence tests, 
and personality tests. Some of this historical material is new, i.e., 
not found in other readily available books on the subject. 

The chapter on statistics, and there should be one in every book 
on tests and measurements, is clearly and well written. It includes 
three tables and five figures, all appropriately selected. There is 
nothing about non-parametric statistics in the chapter and in a 
small book one would not be apt to expect such material, yet there 
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is a need for students in psychology to understand this area of | 
statistics. ; 

Three chapters, one dealing with tests of intelligence, one with 
tests of special ability, and one with personality assessment, en- 
compass the standardized test area. Each chapter defines the con- 
cept, describes and illustrates some of the typical and best (valid 
and reliable) instruments employed in the field, discusses some of - 

‘the research which is pertinent, and then directs the reader's at- 
tention toward the future or expected progress in the area. 

The author brings the student-reader up-to-date in her discussion 

of testing or assessment in these three chapters. Particularly cur- 
rent and scarcely controversial is the material on (1) the IQ, ( 
в test’s fairness, (3) assessment of the total individual, and (4) 
proper interpretation and use of the results of testing. The reader. 
becomes aware of the author's awareness of the criticisms of tests | 
and measurements through her diplomatic suggestions designed tO 
avoid a confrontation on the issues. One leaves the reading of this 
book with the impression that test users have been misusing am 
misinterpreting tests, particularly in employing measurement її 
rene to accomplish goals for which these instruments are 
valid. 


GLENN W. DURFLINGER 
University of California at Santa Barbara (Emeritus) 


fr 
T 


ит MEASUREMENT 


‘Editor: W. Scott Gehman у 
Managing Editor: Geraldine R. Thomas 


BOARD OF COOPERATING EDITORS 


Donorgv С. ADKINS, University of H ‘await 
Lewis В. AIKEN, JR., Guilford College 
"Hanorp P. Веснтогот, The University of Iowa 
WILLIAM У. CLEMANB, Science Research Associates, Inc. 
Louis D. Сонем, University of Florida 
Junius A. Davis, Educational Testing Service 
Hanorp А. EDGERTON, Performance Research, Inc. 
Max D. Емокьнлвт, Duke University. 
Gene V Grass, University of Colorado 
E. B. Greann, Chrysler Corporation (Retired) 
J. P. Gururonp, University of Southern California, Los Angeles 
’ Jonn А. Новмарат, Babson College 
Jonn E. Horrocks, The Ohio State University 
Cyri J. Нотт, University of Minnesota 
Мплом D. Јлсоввом, University of Virginia 
Јовирн C. Јонмвом II, University of Connecticut 
WILLIAM С. Karzenmever, Duke University 
E. F. Linguist, State University ој Iowa 
Fanpertc М. Lorn, Educational Testing Service 
Авотв Lusty, Temple University 
Louis L, McQuirry, University of Miami, Coral Gables 
' Wurm B. Мтенлвь, University of Southern California, Los Angeles 
Howard G. MILLER, North Carolina State University at Raleigh 
Henry Movanamian, City Colleges of Chicago 
Euuis В. Pace, The University of Connecticut 
Млмвову $. Rasu, Science Research Associates, Inc. 
Ben Н. Romine, Je., University of North Carolina at Charlotte 
Kenpon Smiru, The University of North Carolina at Greensboro 
THELMA С. Тнонзтомв, University of North Carolina at Chapel Hill 
НЕввєвт А. Toors, The Ohio State University 
WILLARD G, WARRINGTON, Michigan State University 
Joan Е. УУпаллмз, Wake Forest University 
Е. С. WILLIAMEON, University of Minnesota 


VOLUME THIRTY-TWO, NUMBER TWO, SUMMER 1972 


CATIONAL AND PSYCHOLOGICAL MEASUREMENT 
32, 235-248. 


EVALUATING THE TEACHING OF INTELLIGENCE! 


PAUL I. JACOBS? 


AND 
MARY VANDEVENTER? 


Division of Psychological Studies 
Educational Testing Service 


For many years psychologists have been intrigued by the 
"question of the relative importance of heredity and environment 
| in determining differences in intelligence. Their major research 
| method has been the correlation of intelligence test scores of 
people with varying degrees of genetic similarity. Thus, for ex- 
ample, they have shown that heredity is important because 
monozygotic twins, who are genetically identical, are more alike 
dn their intelligence test scores than are dizygotic twins, whose 
‘degree of genetic similarity is only that of ordinary siblings. 
The users of this research method have generally been content to 
define intelligence as “the ability measured by intelligence tests.” 
Recently some psychologists have taken a different approach 
to the heredity-environment question. They argue that if en- 
vironment is important, then it should be possible to create en- 
Vironments that foster intellectual growth. With this approach 
More care must be taken with the definition of intelligence. In 
`8 trivial sense, for example, one could increase a person's intel- 
ligence by giving him practice answering those very questions 
that comprise the test one uses to measure intelligence. Clearly, 
then, more meaningful research must take into account the rela- 
lion between the training operations and the operations used to 
К ааг Was supported by the National Institute of Child Health and 

. Human Development, under Research Grant 1 P01 HD01762. 
| {Now at Ferkauf Graduate School, Yeshiva University. 
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measure intelligence. This paper suggests a rigorous procedure 
for doing so. 


The Teaching of General Cognitive Skills 


Let us postpone temporarily the definition of intelligence and 
examine the methodology of some studies that have attempted to 
teach what we may call general cognitive skills. We will use the 
framework provided by Campbell and Stanley (1963), who dis- 
tinguish between the internal and external validity of an instruc- 
tional research study. 

Internal validity is concerned with whether the effects of the 
instructional treatment can be assessed unambiguously. There 
are eight classes of variables, which, if not controlled, would 
confound the effect of the treatment: history, selection, matura- 
tion, selection-maturation interaction, testing, instrumentation, re- 
gression, and mortality. The classic studies of Thorndike and 
Woodworth (1901) that tried to increase attention, observa- 
tion, and discrimination abilities employed what Campbell and 
Stanley call a One-Group Pretest-Posttest Design. This design 
leaves uncontrolled at least five of these classes of variables. 
We need not, therefore, be unduly discouraged about the pos- 
sibility of teaching general cognitive skills by Woodworth’s con- 
clusion: “The general result was that improvement . . . though 
often present was irregular and undependable" (1938, p. 194). 
More recent studies (e.g., Anderson, 1965; Wittrock, 1967) have 
demonstrated that young children can learn and transfer gen- 
eral problem-solving skills and strategies. These studies employed 
a Pretest-Posttest, Control Group Design that controlled all of the 
eight classes of potentially confounding variables. 

External validity is concerned with the generality of findings. 
Four factors can jeopardize external validity: interaction of testing 
and treatment, interaction of selection and ireatment, reactive 
arrangements, and multiple treatment interference. The first three 
of these may have played a role in the Thorndike and Woodworth 
studies. 

But one cannot obtain external validity merely by avoiding 
specific dangers in design or procedure. On the positive side one 
must specify the universe to which one wishes to generalize and 
then adequately sample from it (Bracht & Glass, 1968). Typically, 
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one’s interest transcends the particular subjects, experimental 
treatments, and criterion measures used. There may be severe 
practical problems in specifying and sampling from the popula- 
tions of subjects and of instructional treatments in which one is 
interested. But the greatest problem is specifying and sampling 
from a universe of instructional outcomes to determine the trans- 
fer value of the instruction. 

Often one does not rigorously define a universe of outcomes, 
but alters some details of the training task in an intuitively 
compelling way to test for "transfer." Thus Thorndike and 
Woodworth (1901) attempted to increase Ss’ skill in estimating 
areas by giving them practice in estimating rectangular areas 
and then tested them on estimating triangular areas. They tried 
to increase Ss’ skill in cancellation by giving them practice in 
cancelling words containing the letters e and s and then tested 
them in cancelling words containing the letters i and ¢. The more 
recent studies have also been interested in the transfer aspect 
of external validity. Jacobs and Vandeventer (1971a), for ex- 
ample, taught first-graders to solve double-classification problems 
based upon rules of color and shape and found transfer to prob- 
lems based upon rules of size and shading. 

One can delimit the extent of generalizability by broadening 
the range of transfer situations. Anderson (1965), for example, 
taught first-graders to solve problems by varying each factor in 
succession while holding’ all other factors constant. The train- 
ing task involved cowboys who could be with or without hats, 
guns, and horses. He measured transfer to problems with the 
same logical structure but different specifics of content. Experi- 
mental Ss outperformed control Ss on a suburbia task in which 
houses could be one or two stories high, on the north or south 
side of the street, with or without trees, and with or without cars, 
and on a pencil task in which pencils could be long, medium, or 
short, with or without an eraser, and sharpened or unsharpened, 
but not on a pendulum task in which pendulums were of dif- 
ferent lengths, had different weight bobs, and were swung 
through different angles. From these results it is clear that trans- 
fer has not been obtained to the entire range of problems with the 
same logical structure. But since no explicitly defined universe 
of tasks has been sampled, it is not clear, aside from the specific 
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instances tested, for what content one should or should not ex- = 


pect transfer. Anderson offers some speculation on this point. 


The Teaching of Intelligence 


Suppose one wanted to teach “intelligence.” We will define 
this to mean teaching the skills that enable a person to handle 
a certain set of tasks. Guilford’s (1967) Structure-of-Intellect 
model could be used to define a relevant universe of tasks. The 
model posits 120 categories of intelligence, each involving one of 
five operations (Cognition, Memory, Divergent Production, Con- 
vergent Production, or Evaluation), one of four contents (Fig- 
ural, Symbolic, Semantic, or Behavioral), and one of six prod- 
uets (Units, Classes, Relations, Systems, Transformations, ог 
Implications). If tests existed for each of the 120 categories, E 
would be in a good position. He might first choose at random one 
category of intelligence to teach. After Ss were taught, say, 
Convergent Production of Figural Systems, Е might assess trans- 
fer to Convergent Production of Figural Products of all six kinds, 
or to Convergent Production of Systems with all four kinds of 
content, or to all five kinds of Operations with Figural Systems, 
or, through sampling, to all 120 categories of intelligence. 

There are, however, three difficulties with the use of the Struc- 
ture-of-Intellect (SI) model as a basis of assessing transfer from 
attempts to teach “intelligence” as defined above: 

1, Tests do not exist for all 120 categories. 

2. There is no evidence that different investigators would agree 
with Guilford’s placement of specific tests in specific cells. One 
critic has stated that “. . . the validity of the SI model is in 
the eye of the beholder” (Carroll, 1968). 

3. The SI model may be too general given our present ability to 
devise teaching techniques. It is unlikely that we know how to 
teach one category of SI in a way that will transfer across op- 
erations, contents, or products. Guilford (1967, p. 475) himself 
hedges on this point. Our knowledge of how to teach intelligence 
is minimal and, indeed, the idea of teaching intelligence is fre- 
quently met with skepticism. It would seem reasonable, there- 
fore, to adopt a conservative strategy. Rather than teaching 
within one category of the SI model and testing for transfer 
to other categories, we should teach and test for transfer within 

the same category. 
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The category cognition of figural relations (CFR) is admir- 
ably suited for our purpose. CFR items make up in part or in full 
a great many intelligence and general aptitude tests and in- 
volve *. . . the kind of product—relation—on which Spearman 
appeared to stake everything pertaining to g and its dominant 
role in intelligence” (Guilford, 1967, p. 85). These items also fit 
Guttman’s definition of intelligence: “An act of a subject is in- 
telligent to the extent to which it is classified by a tester as 
demonstrating a correct perception of an unexhibited logical as- 
pect of a relation intended by the tester, on the basis of another 
exhibited logical aspect of that relation that is correctly per- 
ceived by the subject” (1965, p. 168). 

Figure 1A provides an example of a CFR item. Abstract ge- 
ometric forms are arranged in the cells of a matrix. $ is asked 
what belongs in the empty cell in the lower right-hand corner. 
To solve the item, he must induce the following relations: (1) 
shape changes from column to column, and (2) shading changes 
from row to row. He must then combine these relations to arrive 
at the answer. 

Suppose an 4, who initially could not solve Matrix A, was 
taught to do so. He could show transfer by now being able to 
solve other CFR items that formerly he could not. But how dif- 
ferent should these other items be? Matrices В, С, D, Е, and F 
are items in the same format as Matrix A, and they also involve 
shape and shading relations. They differ among themselves in 
the number of cells they have in common with Matrix A. The 
number of common elements may be а useful indicant of sim- 
ilarity in studying transfer (Ellis, 1965). We can thus say that 
Matrix B, which has six elements in common with Matrix A, is 
more similar to Matrix A than is Matrix E, which has only four 
elements in common with Matrix A. Among Matrices B through F 
the one matrix on which $ could demonstrate maximum transfer 
is Matrix F, which has no elements in common with Matrix A and 
is therefore most dissimilar to it. Implicit in this discussion is 
the assumption that S can discriminate among the elements. If he 
cannot, his apparent failure at the task may be attributed to 
Perceptual confusion rather than lack of transfer. 

Even S's successful performance on Matrix F following instruc- 
tion on Matrix A would still indicate а relatively low level of 
transfer, because the same shape and shading relations are in- 
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4 — volved. 5 could show higher levels of transfer by successful per- 
ni ‘formance on Matrix С, which involves one new relation (based 


on size), Matrix H, which involves two new relations, Matrix I, 
which involves three relations (two old and one new), and 
Matrix J, which involves different relations for different ele- 
ments within a cell. A still higher (and perhaps unlikely to be 
reached) level of transfer would be indicated by competency on 
Matrix К, which, by using words rather than geometric shapes 
as elements, involves what Guilford would eall the cognition of 
symbolic relations (upper element in each cell) as well as the 
cognition of semantic relations (lower element in each cell). 
We could thus answer the question of whether instruction on 
Matrix A has increased a subject’s intelligence by finding out 
how well he can infer the relations contained in Matrices A, 
G, H, etc. 


The Content of Intelligence 


But our answer would only be meaningful if Matrices A, G, H, 
ete. covered some finite set of relations that constituted the “соп- 
tent” of intelligence. Test constructors have generally neglected 
the question of whether there is such а set; Guilford says only 
that “The number of possible relations is very great" (1967, 
p. 242). In the absence of explicit agreement on this matter, we 
can note the extent of de facto agreement by examining what 
relations existing intelligence tests make use of. 

To this end we examined the CFR items in 166 group intel- 
ligence tests listed in the Sixth Mental Measurements Yearbook 
(Buros, 1965) as well as 35 other tests in Educational Testing 
Service's test collection. For our purposes, we excluded items in 
which the relation involved meaning (e.g, hand and glove), and 
items involving only a one-dimensional trend (e.g. the Series 
Section of Cattell’s Culture Free Test). We further restricted 
ourselves to those tests containing 10 or more CFR items meeting 
Our specifications. This left us with a universe of 1335 items in 
22 tests. 

А casual inspection of the 1335 items suggested that far fewer 
than 1335 different relations were involved. We have already 
Seen, for example, that the category shading can be used to cover 
the relation between rows for Matrices А, Е, and I of Figure 1, 
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although the specific elements and kinds of shading differ among 
the matrices. We sought therefore to define a small set of such 
categories of relations that would cover the majority of the 1335 
items. 

In general we sought categories that would be easily under- 
stood and intuitively compelling. By way of contrast, Evans 
(1964) has taken a mathematically more rigorous approach to 
this problem. Our approach led us to accept categories that are 
not always mutually exclusive. For example, some matrices could 
be said to involve either movement in a plane or flip-over. Move- 
ment in a plane involves a figure changing its position in the 
cell as if it were physically slid around on the matrix with- 
out ever being lifted off the surface (Figure 2A). Flip-over in- 
volves & figure, initially lying face-up on the matrix, being turned 
over and replaced face-down (Figure 2B). Matrix C of Figure 2 
could be said to involve either movement in a plane or flip-over. 

At first, we derived a set of 13 relations which we believed could 
be used to classify the vast majority of the items. We wrote a 
manual to explicate these relations as they apply to items in 
Matrix format. The manual went through several stages of 
revision based upon tryout with nine colleagues. We sharpened 
our definitions of the relations and reduced the set to 12. A sum- 
mary of these relations is presented in Figure 3. 

The next steps were to determine how well the set of logical 


Figure 2, The contrast between movement in a plane and flip-over. 


+The final form of the manual has been deposited with the National 
Auxiliary Publications Service. Order Document "No. NAPS01783 from the 
National Auxiliary Publications Service of the American Society for Informa- 
tion Services, c/o CCM Information Sciences, Inc, 909 3rd Avenue, New 
York, N.Y. 10022. Remit in advance $5.00 for photocopies or $2.00 for micro- 
fiche and make checks payable to: Research and Microfilm Publication, Inc. 
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complete change of form, or 
systematic change, as from 
solid to dotted lines 


Shape: 


proportionate change, as in 
photographic enlargement 


Flip-over: figure moves as if lifted up 
and replaced face down 


а new element is intro- 


Added element: 
duced, or an old one removed 


Unique addition: unique elements are treat- 
ed differently from common 
elements, e.g., they are 
added while common elements 
cancel each other out 


each element appears 
three times in a 3 x 3 
matrix 


Elements of а set: 
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relations could be communicated to others and how well it ap- 
plies to the universe of interest. A 22-item test was created by 
choosing at random one CFR item from each of our 22 source 
tests. This process was repeated three times without replace- 
ment to produce four randomly parallel alternate forms. The 
majority of these items were already in matrix form. Those that 
were not were put into matrix form: for example, an item like 


[Apes e.» 


was converted to 


15 to 


The four tests were then used to check the communicability of 
the relations. The two authors jointly judged which of the 12 
relations applied to the items in the first test. A graduate stu- 
dent in psychology independently judged these items and then 
diseussed with the authors alternate interpretations, difficulties, 
ambiguities, ete. The student's only introduction to the task 
was through reading the manual, to which he was allowed to re- 
fer while making the judgments. Each author and the graduate 
student then independently judged which of the 12 relations ap- 
plied to the items in the second, third and fourth tests. 

The results of the judgments are presented in Table 1. On all 
tests the judges showed substantial agreement. In all cases but 
one, two judges accepted the third judge's categorizations as rea- 
sonable alternatives, given the nonexclusive nature of the categories. 

This suggests that the set of categories of relations can be com- 
municated to others. Ideally, the communicability of concepts 
should be established by randomly selecting judges from the 
population with which the experimenter wishes to communicate 
(Brandt, 1968). In this case, where the population is hard to 


define, the reader may check the communicability of the manual 
for himself, 
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TABLE 1 1 
Number of Items on Which Particular Kinds of Agreement or Disagreement Were 
Found Among the Three Judges 
Second Test Third Test Fourth Test 
Three judges agreed completely. 13 14 12 
Two judges agreed that the 
third judge’s categorizations 8 7 9 
were acceptable alternatives. 
Three judges agreed item did 
not fit set of categories. 0 1 1 


Disagreement among judges 
as to acceptable alternatives. 1 0 0 


The data of Table 1 also show that the three judges agree only 
12 different relations are needed to cover 21 out of 22 of the 
items in each test. Let us consider the universe of items based 
upon all possible combinations of these 12 relations. We have 
shown there is de facto agreement among test constructors that 
within this universe lies almost all the “content” of the CFR 
category of intelligence. 


Evaluating the Teaching of Intelligence 


A subuniverse consisting of all possible pairs of these rela- 
tions could serve as a meaningful universe of instructional out- 
comes for the potential trainer of CFR intelligence. It would be 
broad enough to reflect the collective judgment of test construc- 
tors as to the content of CFR. It would also be sufficiently de- 
limited (there are only 66 possible pairs) to permit assessment of 
transfer throughout the universe. The trainer might, for example, 
choose six pairs of relations to deal with in training and measure 
transfer to a large proportion (or perhaps all) of the other 60. 
He could then make a meaningful assertion about the effect 
of his training on CFR intelligence. 

In effect, then, we have taken what Guilford refers to as СЕВ, 
to be the universe of outcomes in which we are interested and 
have given an internal structure to this universe by creating 
а micro-taxonomy of categories of relations. The taxonomy en- 
ables us to spell out the operations for assessing transfer of 
training within this domain. We suggest that the trainer who 
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has produced transfer throughout this universe has increased 
CFR intelligence. In this way we have given a focus to investi- 


gators who want to demonstrate the importance of environment 
for intelligence. 


One problem an investigator who wanted to follow our ap- 
proach would face is that of developing criterion measures. While 
test constructors implicitly agree that our 12 relations constitute 
the content of CFR intelligence, this agreement does not ех- 
tend to the relative importance of each of the 12 relations and 
combinations of them. Table 2 compares the combinations of 
relations present in two of our source tests: the widely used 
Lorge-Thorndike Intelligence Tests (Level 3, Section 3) and 
Raven’s Progressive Matrices (Sets B, C, D, and E). Every item 
has been tallied for each pair of relations it is based upon: an 
item based upon only two relations is tallied once; an item based 
upon three relations a, b, and с is tallied three times for the 
pairings ab, ac, and be, etc. 

A striking finding is how poorly each test samples from the 
entire range of possible pairs. The 48 items of Progressive Ma- 
trices cover only 20 of the 66 possible pairs. Certain relations 


TABLE 2 
The Frequency with Which Pairs of Relations Occur Within an Item in Raven's 
Progressive Matrices (Sets B, C, D and E) and the Lorge-Thorndike In- 
telligence Tests (Level 8, Section 3) 
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(addition and size) occur only in combination with themselves; 
one relation (reversal) does not occur at all. On the other hand, 
certain combinations are over-sampled (e.g., shape and elements 
of a set). The 24 items of the Lorge-Thorndike Intelligence Test 
cover only 14 different pairings of relations. Five relations 
(identity, elements of a set, addition, unique addition, and те- 
versal) do not appear at all. It is clear a new test is needed that 
would adequately cover the CFR domain. One attempt to de- 
velop such a test and to use it to evaluate instruction aimed at 
increasing intelligence is presented elsewhere (Jacobs and Van- 
deventer, 1971b). 

Certainly alternate operational definitions of what it means 
to increase intelligence are possible, and one may argue that in- 
creasing CFR intelligence does not mean increasing general in- 
telligence. As Guilford put it: 


A final plea is that educational research, whether it is on 
intelligence or intelligence testing, should proceed on the basis 
of adequate psychological theory; & fruitful frame of reference, 
whether it is the structure-of-intellect theory or some better 
theory. Only in such an approach can research and its outcomes 
be rich with meaning, and have generalizable features (Guilford, 
1968, p. 27). 


Hopefully the use of formulations as explicit as the present one 
will lead to new insights regarding the relative importance of 
heredity and environment in intelligence. 
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WHY IS A LONGER TEST USUALLY A MORE 
RELIABLE TEST? 


ROBERT L. EBEL 
Michigan State University 


Охе of the best known properties of the tests commonly used in 
educational and psychological measurement (1.е., those composed 
of tasks which sample some particular domain of tasks) is that 
the longer they are (ie. the larger the sample of tasks), the 
more reliable are the scores they yield. This fact can be ac- 
counted for, loosely, by the general principle that large samples 
tend to yield more precise estimates of population parameters 
than do small samples. But, а somewhat different and a some- 
what more precise explanation can be given on the basis of these 
relations: 

1. The true component of a score is proportional to the number 

of equivalent elements that contribute to it. 

2. 'The error component of a score is proportional to the square 
root of the number of equivalent elements that contribute 
to it. 

The remainder of this paper will be devoted to supporting the 
credibility of these two propositions and to relating them to the 
Spearman-Brown formula. 


` Credibility of the Propositions 
The usual mental test score is the sum of the scores on a 
number of items. Consider, for the sake of simplicity and clarity 
in the argument, a two-item test in which X + Y = Z. Each 
of these can be regarded as the sum of a true score and an error 
of measurement so that 
249 
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Х=х,+Х, Y-Y,tY, 2 = 2, + 2. 
о 7, = X, Y, and 17, = X, + Y. 


Now what any component variable like X; ог Y; contributes to а _ 

composite variable like Z; depends on its variability. The fa- D 
miliar formula for the variance of a sum, applied to this situ- _ 
ation, gives : 


On = On + са“ F ran аи (1) 3 


Since the true scores X, and Y, represent perfectly accurate measure- И 
ments of whatever general characteristic the test аз a whole meas- 
ures (for whatever else these items measure is considered to be Xe _ 
and У.) they should be perfectly correlated, so that ти, = 1.00. Then 
if we replace о, cj? and ошту by their average value, represented — 
here as oi”, equation (1) becomes |. 


са? = hon (2) 


ого, = 204, (8) 


Obviously if the same procedures were applied to а three-item test, 
the equation corresponding to (3) would be о = 3 oy, and for a 
50-item test, са = 50 си. Hence the generalization that the con- 
tribution (1.е., variability) of the true component in the total score | 
is proportional to the number of elements (items) composing it. 3 

Perhaps it may be worth noting at this point that the true 
component of an item score in this context is not whatever the 
item measures consistently (ће. reliably). It is rather what the 
item measures that all other items in а test also measure. Thus, 
if a test should include some items measuring verbal ability 
and other items measuring quantitative ability, the true com- 
ponent of the verbal item scores would not be whatever ability 
they measure consistently and the true component of the quantita- 
tive item scores whatever the quantitative items measure con- a 
sistently. Instead it would be only the ability common to both. 4 
verbal and quantitative; only the degree of success or lack of success _ 
that students tend to show оп both kinds of items indiscriminately. | 


Consider next the error component of the total score which, in ) à 
the two-item case is З 


2,=Х,+Ү, 
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The variance of these errors is given by the equation corres- 
ponding to (1) 


cu^ = cu! F ou! rust) (4) 


But in this case, since the errors are random and uncorrelated, 


Tay = 0. 
If we replace og,” and oyẹ by their average value, equation (4) 
becomes 


Ox = дал (5) 
Orc, = Viie (6) 


Again, for a three item test equation (6) would become о» = УЗ oie} 
and for a 50-item test о» = 1/50 oje. Thus, the contribution of the 
error component is proportional to the square root of the number 
of items contributing to it. 

In view of these relations, it is obvious that increasing test 
length increases the true score variance more rapidly than it in- 
creases the error variance. For example, suppose we start with a 
25-item test whose score variance is 10 and whose reliability is .50. 
This would mean that the true score variance is 5 (г = а/аг“) 
and the error variance also 5 (os? = о? + ог), as shown in the 
first column of figures in Table 1. If the test is increased to 50 
items, the true score standard deviation is doubled, which means 
that the true score variance is increased four times. But, the error 
variance is only doubled, since the error standard deviation is in- 
creased only by \/2. Thus, the total variance (o? = о? + с?) 
is 20 + 10 = 30, and the reliability (r = ой/ог?) is 20/30 = .67. 
Other columns in the table may be derived similarly, given the 
number of items in the test. 


TABLE 1 
Effects of Increases in Test Length on Test Score Characteristics 
Test 1 2 3 4 
Number of items 25 50 100 200 
Score variance 10 30 100 360 
True variance 5 20 80 320 
Error variance 5 10 20 40 


Score reliability .50 .67 .80 :89 
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Relation to the Spearman-Brown Formula 


It is interesting to note that the score reliabilities in the bottom 
row of Table 1 are exactly what would be obtained if the Spear- 
man-Brown formula were used to predict the effect of lengthen- 
ing the original 25-item test. This suggests that the Spearman- 
Brown formula might be derived on the basis of the relations 
stated at the start of this paper. 

Reliability is defined here as the ratio of the true score variance 
to the total score variance, which is equal to the true score vari- 
ance plus the error variance. Thus, the formulas for reliability 
of a test of unit length (7), and of a test n times as long (rn) 
would be 


2 2 

"= Tii TE бт 
е 2 2 ve 2 2 
Š си Ч ви З Om Ч Om 


The relations stated at the start of the paper, expressed alge- 
braically, are 


сы HNO | бо = пе, 
so that the reliability of the lengthened test, expressed in terms 


of the variances of the unit test would be: 


n^ ci 


N ee 
пон neu 


Dividing numerator and denominator of the right hand member of 
the equation by он? + oo? gives 


Morr 
2 2 
= s tcs vr. 
^ P PE 3 T. 2 
i па: 
ЕЕ ony, s 
са Ч он ote, gh ЕЕ Cai 
2 
Now c, = са“ — сц and o? + 0,2 = 6 
2 2 
So کو‎ = fei Без ЛЕ 
са d c, P 4 
Thus 7, = т; nr; тт: 


тт, FRIED по ва It a n 
which is the Spearman-Brown formula. 
The relations here discussed between true scores and errors of 
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measurement in mental test theory are closely analagous to the 
relations between signal and noise in communication theory. 


^. . . noise integrated over a certain length of time is 


proportional to the square root of such time, whereas the 
signal accumulates linearly with it” (Fox, Garbuny, and 
Hooke, 1963). 


Thus, a signal which would be lost in noise if sent only once may 
be received successfully if sent repeatedly and if the separate 
transmissions are suitably combined or integrated in the receiver. 

It is interesting to note, on the basis of these relations, the 
high ratio of error variance to true score variance in the score on 
a single test item. On a 100-item test with a reliability of .90, 
the true score variance is nine times the error variance. But, for 
a single typical item of that test, based on the relations described 
in this paper, the true score variance would be only .09 of the error 
variance. 

The differential relations of true scores and errors of measurement 
to the number of items in a test were pointed out by Gulliksen 
(1950). Hence, this note does not claim to add anything new to test 
theory. Its justification must lie in the fact that many who are 
concerned with reliability are unfamiliar with the relationship, and 
that awareness of it can contribute substantially to an understand- 
ing of the concept of reliability. 
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EFFECTS OF SOME VARIATIONS IN RATING SCALE 
CHARACTERISTICS ON THE MEANS AND 
RELIABILITIES OF RATINGS 


R. H. FINN 
University of Georgia 


Tue matter of rating scale construction and use has received 
a great deal of attention over the years, with an excellent sum- 
mary of pertinent work being provided by Guilford (1954, Chapter 
11). Certainly one of the more widely used types of rating scales is 
the simple graphic, ordinal type, and there seems to be an unlimited 
variety of ways in which the characteristics of such scales may be 
varied. Two obvious and popular ways in which scale characteristics 
may be varied are the number of scale levels and the manner with 
which scale levels are defined. Basic questions that are thus raised 
concern the effects of such variations on the means and the reliabil- 
ities of the resulting ratings. Several earlier studies seem to be espe- 
cially germane to these questions. Madden and Bourdon (1964) re- 
port a study of occupational evaluation where each of 15 occupations 
was rated on nine occupation-requirement factors with a scale of 
nine levels. The rating scale format, which was varied in seven 
ways, was defined as the physical arrangement in which the 
rating scale definitions and levels were presented to the rater. , 
A three-way analysis of variance showed significant F values 
for all main effects as well as for all interactions. Rating scale 
format was the main effect of primary interest, and while the 
variations in format did result in statistically significant dif- 
ferences in mean ratings, the differences were so small as to ap- 
Pear trivial Peters and McCormick (1966) studied the reli- 
ability of ratings with a numerically anchored scale compared 
to ratings with scales anchored by job-task statements. Subjects 


255 
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rated job-task statments on five dimensions of worker activity, 
using 7-point equal-appearing interval scales. The basic data 
suggested a general superiority of the job-task anchored scales 
but when reliabilities were corrected to estimate the reliability 
of т raters, this superiority disappeared. Miller (1956) examined 
a variety of published and unpublished research involving ab- 
solute judgement of simple, unidimensional stimuli (judgment 
of tones, loudness, taste intensity, etc.). He concluded that there 
is a clear and definite limit to the accuracy with which subjects 
can identify absolutely the magnitude of such stimuli, and that 
this span of absolute judgment is in the neighborhood of seven 
classes or alternatives. He makes the further observation that 
“jt is interesting to consider that psychologists have been using 
seven-point rating scales for a long time on the intuitive basis 
(emphasis added) that trying to rate into finer categories does 
not really add much to the usefulness of the ratings (p. 84).” 
Symonds (1924) suggested that a seven level scale was optimum 
for rating personality traits, observing that a larger number of 
intervals does not yield an increase of reliability sufficiently 
great to make the increase worthwhile. From his own work, as 
well as that of others, he found that a reliability of about .55 
was & good average value for rating personality. He made the 
premise that the difference between an obtained r and the true r 
must not be greater than an amount equal to a coefficient of 
alienation of .0213. Then using a formula for correcting a co- 
efficient of correlation for variations in the number of scale 
intervals, he showed that a reliability of .60 may be obtained 
with a 7-class rating scale with no more than the specified loss 
of reliability. 

These studies suggest that variations in the manner of defining 
scale levels have little effect оп the resulting means and reli- 
abilities of judgments, but that the number of levels or alterna- 
tives has an important bearing on reliability. The purpose of 
this research was to further explore the effect of variations in 
these two scale dimensions (manner of defining scale levels and 
the number of scale levels) on the nature of the resulting ratings. 


Study I 
The first study, а preliminary one, was conducted to test the 
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hypothesis that the manner of defining scale levels does not ef- 
fect the means or the reliabilities of ratings. The manner of 
defining scale levels refers to the type of definition provided, 
verbal or numerical. In this study, the number of scale levels 
was held constant. 


Procedure 


Thirty subjects (undergraduate students) were randomly as- 
signed to one of three experimental groups (Groups I, II, and ПТ) 
with each group having 10 subjects. Each subject rated 10 jobs 
(Job A through Job J) with reference to a five level job com- 
plexity factor. The 10 jobs were of a clerical nature and were 
described with one page job descriptions providing a job sum- 
mary and illustrative examples of the work. The order in which 
the jobs were presented for rating was randomized for each sub- 
ject. The levels of the job complexity factor constituted an 
ordinal scale. Definitions of the factor and three of the seale 
levels are presented for illustration. 

Job Complexity: This factor refers to the variety of activi- 
ties and knowledge required, extent to which work procedures 
are routine and standardized, and length of the work cycle. 
This factor is not concerned with the dificulty of the job ac- 
tivities, as such. 

Level 1 (lowest level): Job duties and activities are standardized 
and routine in nature with a relatively short work cycle and a 
good deal of repetition of work activity. Procedures are well 
defined and few, if any, distinctly different activities are in- 
volved. Deviations from routine does not require significant in- 
dependent judgment. 

Level 3: Job duties and activities involve a considerable va- 
riety of tasks, or a few tasks which in themselves consist of a 
variety of elements or procedures. Work follows standard pro- 
cedures but judgment must be used in handling matters that 
deviate from the usual pattern. 

Level 5 (highest level): Job duties and activities are quite 
broad in scope and are prescribed in only general terms. An un- 
derstanding of interrelated activities and procedures of the Uni- 
versity is required. Standard procedures are not characteristic 
of this level. Considerable judgment must be used in analyzing 
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TABLE 1 | 
Means and Variances for Treatment Combinations: Study I 
Jobs 

Groups* A B [6] D E F G H I J Pooled! 
Group I 2 1.8 2.3 34 15 14 2.2 3:9 44 4.0 3.1 2.8 | 
v. 462. .90 116 228 O .40 LOA 71.07 .54 Ш 

Group II 2 1.7.26 35 18 17 22 41 43 4.3 8.4 2.9 

DAB IO ТОТ та ІВ 23 1:07 547 .68: .28 .71 5 

Group III 2518 2.6, 3.2 15 11.7. 2:4 3,8 4.3 3.6 3.6 2.8 

и 9 0 о 6816 .49 @ 


a Group I (а levels defined). 


Group II (levels 2 and 4 defined). 


Group III (no levels defined). 


and handling situations in terms of priority of importance and 
impact of consequence. 


verbally defined. Group II performed their ratings with levels 
two and four defined verbally, the remaining levels being identified 
only by number. The scale for Group III had none of the levels 
defined verbally, except that level one was identified as the 


lowest level and level five as the highest level. Otherwise, the 
levels were defined only numerically. 


Results 


Means and variances of the ratings are presented in Table 1. 


Subjects in Group I used a rating scale with all five levels 
The data were first analyzed with an analysis of variance 


design appropriate for a two-factor experiment with repeated 


measures on one of the factors (Winer, 1962, Chapter 7). Re- | 
sults are shown in Table 2. 


The only significant F value was that associated with the 


TABLE 2 
Analysis of Variance: Study I 
Source ај MS F 
Between Subjects 
Type Definition 2 .67 = 
Subjects W/Groups 27 1.41 
Within Subjects у 
Jobs 9 | 
J X TD 18 VE oret cnn 
J X Subjects W/Groups 243 .50 
а а. р РА ОО ea 
*p < 01. 
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difference in mean ratings for Jobs, a difference which was ех- 
Y pected. The main effect of primary interest was the manner of 
| _ defining the scale levels, Type Definition, and this effect was not 
р 
x 
Р 
Г 


significant. While the experimental design employed was not 

| ideal for studying this main effect because of possible confound- 
ing with differences between Groups, it was felt unlikely that 
these effects would cancel one another. Another important rea- 
son for using this particular design was that it lends itself well 
to use of the intra-class correlation as an estimate of inter-rater 
reliability. Intra-class reliability estimates are presented in Ta- 
ble 3. A second reliability estimate (Finn, 1970)1 is also pro- 
vided for comparative purposes. This latter approach provides 
an estimate of the reliability with which judgments are made 
concerning one or more items, and is an estimate of item re- 
liability rather than inter-rater reliability. This second type 
of estimate was utilized extensively in Study II where it was 
felt to be more appropriate than the intra-class approach. 

All of the uncorrected reliabilities were significant at the .01 
level. Thus it is seen that the ratings were carried out with a high 
degree of reliability under each of the three treatments. The dif- 
ferences between reliabilities of the three Groups (Edwards, 
1960, pg. 83) were not significant. Since neither the means of 
the ratings nor the reliabilities of the ratings were affected by 


TEM P T ~ 


TABLE3 
| Reliability Estimates: Study I 
Groups Intra-class Finn 
| Group I 
| Uncorrected .71 0 
| Corrected .96 9 
> Group IT 
2^ Uncorrected .74 m 
} Corrected .98 . 
| Group 11 kl 
۴ Uncorrected .64 D 
Corrected .95 . 


[К This approach i imate the reliability 
aS 1This approach is suggested where one wishes to estimate the ri 
With which a group of judges place items into scaled categories. The йн, 

İS: r = 10 — [variance (observed)/variance (expected)] where the о d 
Variance is the within variance (of items) and the ezpected variance is S 
theoretical within variance that one would expect if reliability were zer 

random placement of items). А 
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the manner of defining the levels of the rating scale, it is sug- 
gested that this is not an important consideration in rating scale 
construction. 


Study II 


A second study was conducted to examine a greater variety 
of treatment conditions and combinations. The effects of pri- 
mary interest were variations in the manner of defining scale 
levels and the number of scale levels. 


Procedure 


Twenty subjects (graduate students) were randomly placed 
into four groups of five subjects each. Each subject rated four 
jobs (similar to those in Study I) on the same job complexity 
factor used in Study I. The experimental treatments were: 

Groups (G). Four groups (I, II, III, and IV) of five subjects 

each. 

Jobs (J). Four jobs (К, L, M, and М). 

Scale Levels (SL). The number of levels of the job complexity 

factor was varied in four ways; three levels, five levels, seven 

levels and nine levels. 
Type Definition (TD). The manner of defining scale levels was 
varied in four ways: 

(0)—None of the levels was verbally defined except that level 
one was identified as lowest level and the other extreme as high- 
est level. 

(8)—The middle level was defined verbally with the remain- 
ing levels identified only by number (as above, the two extremes 
were identified as low and high). The level definition appropriate 
to level three in the original five level factor was used. 

(1, $)—The lowest and highest levels were defined verbally 
with the intervening levels being identified by number. Defini- 
tions used were those appropriate to levels one and three of the 
original factor definition, and these were placed at the lowest 
and highest levels respectively. 

(1, 6)—This treatment was the same as the preceding one 
except that level five of the original factor was used rather than 


level three, thus creating a qualitative difference between these 
two treatments. 


| 
D 


Soa. 


\ 
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TABLE 4 
Experimental Design: Study 11 


Type Number of Scale Levels (SL) 
Definition (TD) 3 5 T7 
None (0) LL ILN шк IV, M 
Midpoint (3) IV, К ш, M ILL LN 
Extremes (1, 3) Ill, N ІУ, І LM пк 
Extremes (1, 5) им LK IV,N II, L 


Note.—Cell entries аге Groups I, II, III and IV and Jobs K, L, M and N. The first cell (1, 1) 
finds Group I rating Job L with a scale of three levels with none of the levels defined verbally. 


A greco-latin square design, as shown in Table 4, was em- 
ployed for the study. 

All main effects were balanced in the design. The order in 
which subjects within a group were observed under treatment 
combinations was randomized for each subject. 


Results 


Means and variances for each treatment combination are 
shown in Table 5. 

The data were first analyzed by analysis of variance (Winer, 
1963, Chapter 10, Plan 7), and the results are presented in Table 
6. 

The two treatments showing a significant effect were Jobs and 
Number of Scale Levels. Variations in the manner of defining 
the scale levels did not have an effect on mean ratings, nor did 
there appear to be a significant interaction among treatment 


TABLE 5 
Means and Variances for Treatment Combinations: Study II 
Number of Scale Levels 
Type Definition 3 5 7 9 Pooled 
None (0 # 2.2 2.0 5.6 4.0 3.4 
S v .70 .50 .80 6.0 2.0 
Midpoint (3) ОВИЕ 56 42 3.3 
v .2 .2 1.8 3.7 He 
Extremes (1 д 1.2 2.2 2.2 7.0 i 
ilo шы pagis 113.10) gt 
Extremes 1,5 £ 1.2 3.0 3.0 4.6 Ў 
а v 2 1.0 1.5 5.3 2.0 
Pooled 21019922. "41 40 
v 32 -60 1.32 4.0 


r IAM MU fog 2 2. Е E 
Note.—Cell entries correspond to the design presented in Table 4. 
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TABLE 6 
Analysis of Variance: Study II 


Source df MS F 
Between Subjects 
Groups 3 5.28 2.9 
Subjects W/Groups 16 1.83 
Within Subjects 
No. Scale Levels 3 49.55 33.25* 
Type Definition 3 .85 e 
Jobs 3 18.38 12.33* 
Residual 3 3.48 2.38 
Error 48 1.49 
*p < .01. 


(р > .05). The cell variances shown in Table 5 suggest that the. 
homogeneity assumption may not have been satisfied and that 
a scale transformation may have been in order. A homogeneity of р | 
variance test suggested by Cochran (Eisenhart, Hastay, and 3 
Wallis, 1947, p. 388) was thus carried out. When based upon the | 
16 cells, the resulting C value of .239 was close to the critical | 
value (for р < .05) but did not exceed it. Using pooled variances y 
for the four main effects it was found that variances associated 
with Number of Levels did not meet the homogeneity assump- | 
tion, but that pooled variances associated with the other three р: 
main effects did satisfy the assumption. It was felt that the | 
departure from homogeniety was not sufficiently great to require _ 
a scale transformation. 

The results of Study II supported the hypothesis that mean 
tatings are not affected by the presence or absence of verbal 1 
definitions of scale levels, a result compatible with that obtained 
in Study I. It seemed especially noteworthy that subjects re- 
sponded to a scale defined only by numbers in the same way 88 . 
they did to other scales defined in part by verbal definitions. = 
Mean rating appeared to be proportional to the number of scale n. 
levels but there were not sufficient data to justify fitting a curve. 

The final part of the analysis concerned reliability estimates. 
The two treatments of primary interest were Number of Scale 
Levels and Type Definition, The approach previously referred 
to (Finn, 1970), was used to estimate reliabilities. Table 7 
shows reliabilities associated with Number of Scale Levels. In 
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TABLE 7 
Reliability Estimates Corresponding to Number of Scale Levels: Study IT 


Number of Scale Levels 
3 5 7 9 
r (uncorrected) .51* 20078 .07** .40 
r (corrected) .81 .90 .89 
жр < .05. 
**» < 1. 


making these estimate, variances were pooled across treatments 
for each of the four SL conditions. 

All of the reliabilities were significant except for that associated 
with the rating scale of nine levels. The differences among the 
other reliabilities were not significant. It is thus indicated that the 
subjects were not able to cope with the nine level scale, the num- 
ber of levels apparently being too great to permit reliable dis- 
criminations. Even though the differences among the remaining 
reliabilities were not significant, the data suggest that the five 
and seven level scales were optimal. The pooled variance for the 
three level scale was quite small (.82) and yet the resulting re- 
liability was significant only at the .05 level. It would appear 
that three levels do not provide as much opportunity for varia- 
tion in judgment as appears desirable. 

Table 8 shows reliability estimates corresponding to Type 
Definition, and as before, variances used in making the estimates 
Were pooled across treatments. Since the reliability for nine 
levels of the SL treatment was not significant, this treatment 
condition was excluded in making the estimates in Table 8. 

All of the reliabilities were significant, with no significant dif- 
ferences among them. Again, it is interesting to note that the 
Subjects utilized the scales defined only by numbers as well as 


TABLE 8 
Reliability Estimates Corresponding to Type of Definition: Study II 
Type Definition* 
0 3 j 1,5 
r (uncorrected) .70* .67* .68* .60* 
o (corrected) .88 .86 .87 .82 


¢ Excluding sub-treatments with nine scale levels. 
P < .05. 
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they did the scales defined in part by verbal statements. The 
reliabilities were of respectable magnitude. у 


Discussion 


Thus, both the means of the ratings and the reliabilities of the 
ratings were affected by the number of scales levels but not 
by the manner of defining the scale levels. It would appear then 
that the most critical consideration in rating scale construction 
is that of determining the appropriate number of rating scale 
levels. The optimum number in Study II turned out to be 
seven levels taking into account reliability of ratings and the 
desire to maximize variances of ratings. This result confirms 
our “intuition” as discussed by Miller (1956) and the hypothesis 
presented by Symond (1924). 

The finding that the manner of defining the levels of the 
rating scale did not affect either the means or the reliabilities 
of ratings, while differing in some respects with the studies previ- 
ously cited (Madden and Bourdon, 1964; Peters and McCormick, 
1966), seems in substantial agreement with their results. This 
aspect of rating scale construction would thus appear to be of 
little consequence. The ability of the subjects to use scales that 
provided no frame of reference other than numbers (and the 
identification of the extremes of the scale as low and high) was 
somewhat unexpected and invites conjecture. The subjects in the 
two studies reported on here represented diverse academic back- 
grounds, they were not trained in the use of rating scales, and 
they had been exposed only briefly, if at all, to the process of 
job evaluation. It could well be that the content of the rating 
task (1.е., the relative complexity of clerical type jobs) is of gen- 
eral familiarity in our culture and individuals generally have 
developed a common perspective of these jobs as well as similar 
standards which they apply in making judgments about such 
jobs. It would appear that the subjects brought to the rating 
task a set of preconceived and rather uniform judgment stand- 
ards which were independent of the standards provided them 
(i.e. the rating scale definitions). 
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Kırın and Hart (1968) investigated certain chance and syste- 
matic factors affecting the grades assigned by 17 law school pro- 
fessors to the answers of 79 law students and one layman to a 
typical essay question in law school (this question is reproduced 
in Appendix A). These 79 students were enrolled in 16 different 
law schools (five per school) and all answered the common essay 
question as part of their regular final exam in Contracts. Xerox 
copies of the 79 answers to the common question were sent to 
each of the 17 professors for grading. 

On the basis of their study, Klein and Hart concluded: 

1. The level of agreement among the professors in how they 


П ——_____— 
2 We thank the Law School Admissions Test Council for support of this 
research, and the law professors who participated in it. i 
. 3 The professors were asked to grade the papers according to the following 
Instructions: 
“Please read the papers in the order in which you receive them ... (each 
P had a different order so that possible sequence effects were counter- 
balanced). It is suggested, however, that you glance through a few of them 
to get an idea as to their range of quality before you begin assigning 
grades. Assign grades along a 5-point scale, i.e., 1 through 5. The higher 
the number, the better the paper. Try to spread your grades across the 
five categories as much as possible . . . (and) . . . each of the five grades 
should be used at least twice.” 
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graded a single typical essay was high enough to expect 

that law school grades are potentially quite predictable. 
2. Although the basis for the professors’ agreement was not 
clear, they did have a marked preference for longer answers 
and for those written by brighter students. Since measures 
of length and intellectual ability were unrelated, а com- 
bination of them yielded a very high correlation with the 
professors’ grades. 
Individuals who had no law training generally assigned the 
same grades to the paper as did the professors. It ap- 
peared, therefore, that the professors were at least in part 
giving good grades to those papers that presented a per- 
suasive “common sense” answer to the question. 


o. 


Other than length and handwriting quality, however, few 
clues were offered regarding the factors that affect law school 
professors in grading. The purpose of the present study was to 
examine the relationships between various characteristics of the 
essays and the grades assigned to them. It was hoped that in 
this way the factors that affect essay grades assigned by law 
school professors could be identified precisely. Such information, 
in turn, could be useful in designing tests which might improve 
the prediction of grades in law school as well as providing guide- 
lines for writing better essays. Alternatively, the results could 
suggest changes that should be made in grading practices. 


Procedure 


Two law professors did independent content analyses of the 
essays and each compiled a list of topics actually discussed by 
students. Differences between the lists were discussed by the two 
professors and a final list of 19 topics was compiled. These two 
professors with the assistance of two advanced law students then 
separated each paper by topic producing 607 answer parts. Each 
answer part consisted of all that one student wrote on a particu- 
lar topic. No student wrote on fewer than three nor more than 11 
of the 19 topics. 

Xerox copies of the typed answer parts were then sent to four 
law professors who were asked to grade the part answers on & 
five-point scale using all five grades at least once whenever there 
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were eight or more statements on a given issue (see Appendix B 
for the complete instructions). Each answer part was inde- 
pendently graded by two of the four graders. The answer parts 
were graded topic by topic, i.e., all the answer parts for one 
topic were graded before the grader began work on the next 
topic. Graders did not know from which paper an answer part 
came and since a separate numbering system was used for the 
answer parts under each topic, they could not piece the paper 
back together. Hence, any continuity or general halo factor 
that had existed in the original grading was lost. 

The essays also were rated for the use of legal terminology 
or jargon, the number of times a student cited an authority 
such as cases, statutes, etc., and the extent of the discussion of 
the citation. Within each of the major (defined below) issues 
the answer parts were scored according to whether a conclusion 
was reached (1.е., а decision for or against the plaintiff), the 
strength of the argument supporting the conclusion reached, and 
whether arguments were presented for both sides of the con- 
clusion. The latter two scores were obtained by the ratings of 
two independent judges, a law professor and a recent law school 
graduate. 

In addition to the above content characteristics, scores were 
obtained on a number of variables which have been found to be 
related to “readability” (Chall, 1958). These included: average 
number of words per sentence, number of syllables per 100 words, 
number of prepositions per 100 words, and number of conjunctions 
per 100 words. A score also was obtained for the use of transi- 
tional phrases. This stylistic characteristic appeared relevant 
to the quality of the essay and was assessed by two independent 
raters, 

Several studies, such as the recent one by Marshall and Powers 
(1969), have indicated that composition factors are related to 
essay grades, A preliminary investigation of such factors with 
the present set of essays was done by Klein and Hart (1968). 
They found that global ratings of handwriting quality and Eng- 
lish composition correlated .25 and .44 respectively, with the 
mean grade assigned by the law professors. It was decided, 
however, that a more refined analysis was needed. To achieve 
this end, the following compositional characteristics were as- 
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sessed: (a) number of spelling errors, (b) number of grammatical 
errors, (c) number of punctuation errors, and (d) number of 
construction errors and inconsistencies in English usage (such 
as changing person or tense). These four error scores also were 
summed to provide a total composition score. All error scores 
were obtained by having two professional proofreaders examine 
the first typewritten page of each essay. Thus, the error scores 
are based on a uniform length. 

The importance of each of the 19 topics for answering the 
question was judged on a scale ranging from 0 to 5 by seven 
law school professors and these ratings were used to divide the 
issues into major and minor categories. 


Results and Discussion 


The issues and the sum of the ratings of the importance of the 
19 topics by the seven law school professors are presented in 
Table 1. Since the ratings were on a 0 to 5 scale the maximum 
sum from seven ratings was 35, a value which was achieved for 
issue Al. As can be seen, there were seven issues that had a rat- 
ing of 31 or higher and 12 issues that had a rating of 22 or less. 
For purposes of several of the following analyses, the former 
seven issues are classified as “major” and the latter 12 are re- 
ferred to as “minor.” 

In Table 2, the number of students who wrote on cach issue 
and the correlations between the ratings of the first and second 
rater of each issue are listed. These data also are reported for the 
mean rating on major and on minor issues and the sum of the 
ratings on major and on minor issues. There was generally rea- 
sonably good agreement between raters on the major issues: 
the correlations ranged from .51 for issue A2 to .76 for issue А4. 
The interrater correlations were generally lower for the minor 
issues. The sum of the ratings on both major and on minor 
issues had high interrater correlations (r = 90 for major issues 
and r = .83 for minor issues). 

The means and standard deviations of the global ratings of the 
essays by 17 law school professors are presented in Table 3 
separately for students who wrote on each issue versus students 
who didn’t. As might be expected, the mean essay grade for 
students who wrote on any given major issue is larger than the 


| 
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TABLE 1 
Ratings of Importance of the Issues by Seven Law Professors 


Issue Rating 

Ai Could Painter delegate his obligation to paint the houses? 35 
A2 When Painter changed the color of the paint used on the job, 

was this a material breach of contract? 34 
A3 Effect of discontinuance of production of the particular shade 

of paint: Impossibility. 33 
A4 Was Landlover's promise to pay an additional amount to Paint- 

er for painting the houses enforceable? 33 
A5 Did Painter anticipatorily breach the contract? Response of 

Landlover to Painter's actions. 32 
A6  Divisibility of Contract: Painter’s duty to pay for each house 

аз work was completed. 31 
A7  Divisibility of Contract: Painter's right to recover for first four 

houses. 31 
B1 Time of the essence. 22 
B2 Conclusion of final result that should be reached as to the 

rights of both parties. 22 
B3 Discussion of fact that actual cost of paint to Painter was less 

than he charged Landlover: Fraud, etc. 15 
B4 Statute of Frauds. К 13 
B5 Problems in formation of contract: Offer, acceptance, unilateral 

or bilateral, etc. 10 
B6  Tenants of buildings as third party beneficiaries. A 


B7 Discussion of damages. 

B8  Landlover's rights and obligations in relation to XYZ Co. 

B9  Landlover's rights and obligations in relation to XYZ. 

B10 Introductory clauses unconnected with any specific issues of 
discussion. 

B11 Painter's rights and obligations in relation to XYZ Co. 

B12 Miscellaneous—unclassified. 


CORD Woo 


mean essay grade for students who didn’t write on that issue. 
This pattern holds for only six of the 12 minor issues, however. 
This difference is consistent with the distinction between major 
and minor issues. In other words, students who identified and 
wrote on the major issues got better grades. 

Table 4 lists the correlations of the global essay rating by the 
17 professors with both the number of words written on an issue 
and the sum of the ratings on an issue. The number of words 
written on an issue was positively correlated with the global 
rating for each of the seven major issues. With two exceptions, 
however, the number of words written on a minor issue was 
negatively correlated with the global rating. Thus, the more the 
student wrote on the major issues and limited his discussion to 
them, the better his grade. Further support for this argument 
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TABLE 2 
Interrater Agreement: Correlations between Ratings of First and Second Rater on 
Each Issue 
Issue N Correlation 
oe SORE о и ee 

Al 73 .59 

A2 1 67 .51 

АЗ 66 т 

А4 7 .76 

А5 68 .55 

A6 62 .59 

А7 12 .52 

Bi 32 .49 

B2 12 .29 

B3 24 .33 

B4 13 .30 

В5 25 .28 

B6 1 x 

B7 20 .35 

B8 9 » 

B9 5 2 

B10 31 .57 
ви 4 $ 

B12 11 .26 
Mean Rating on Major Issues 79 .80 
Mean Rating on Minor Issues 79 .52 
Sum of Ratings on Major Issues 79 .90 
Sum of Ratings on Minor Issues 79 .83 


* Not reported since based on less than 10 cases. 


is given by the substantial correlation of the global essay rating 
with the proportion of total words written on major issues (7 
= 64). 

The sum of the two ratings on a major issue had a substantial 
correlation with the global essay rating in every case (the Т8 
ranged from .33 for issue A7 to .65 for issue A4). The correla- 
tions of the ratings on minor issues with the global rating, with 
the exception of B4 (r = 44) and B5 (r = .41), were generally 
small. It appears, therefore, that professors were basing their 
overall grades more on the students’ coverage of the major issues 
than the minor ones. 

The intercorrelations of the various paper characteristics with 
the professors’ mean global rating, the sums of ratings ОП 
major and minor issues, law school GPA, and LSAT are reported 
in Table 5. The three major issue scores are seen to predict the 
global essay rating quite well (r = .54 for number of major 
issues written on, r = .64 for number of words on major issues, 
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TABLE 3 


Means and Standard Deviations of Global Ratings of Essay by 17 Professors for 
Students Who Wrote on Each Issue vs. Those Who Didn’t 


Students Who Students Who Didn’t 
Wrote on Issue Write on Issue 

Issue N Mean SD N Mean SD 
Al 72 2.95 91. 7 1.67 .32 
А2 66 2.91 .93 13 2.43 .90 
A3 66 2.98 .92 13 2.09 .70 
A4 76 2.88 92 3 1.59 ў 

А5 68 2.94 .94 11 2.19 .64 
A6 61 2.87 .93 18 2.70 .98 
АТ 12 3.11 1.02 67 2.78 92 
B1 33 3.01 .90 46 2.71 .96 
B2 12 3.15 .93 67 2.78 .94 
B3 23 2.94 -79 56 2.79 1.00 
B4 13 2.43 +93 66 2.91 .92 
В5 25 2.94 1.01 54 2.78 .91 
B6 1 4.06 * 78 2.82 .94 
B7 20 2.60 .95 59 2.91 .93 
B8 9 2.77 .92 70 2.84 .95 
B9 5 2.82 1.14 74 2.83 .93 
B10 30 2.67 1.06 49 2.93 .85 
B11 4 3.06 * 75 2.82 .96 
B12 11 2.33 .88 68 2.92 .93 


* Not reported since based on less than five cases. 


TABLE 4 


Correlations between Number of Words on Individual Issues and Ratings of Indi- 
vidual Issues with Global Essay Rating by 17 Professors 


Number of З 
Issue N Words on Issue Issue Rating 
Al 72 .19 .54 
А2 66 AT AT 
АЗ 66 41 .50 
A4 76 .48 .65 
А5 68 .39 .56 
А6 61 .24 .59 
AT 12 .10 .33 
Bl 33 —.13 —.02 
B2 12 .80 .24 
B3 23 —.05 .05 
B4 13 —.40 Em 
B5 25 —.15 AL 
B6 1 * * 
B7 20 .07 —.03 
B8 9 * * 
B9 5 * * 
B10 30 —.13 .07 
ви 4 * * 
B12 11 —.30 a 


* Not reported since based on less than 10 cases. 
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and r = .89 for the sum of scores on major issues assigned by 
the graders in the present study). Since these three variables have 
substantial intercorrelations, however, the multiple correlation of 
these three variables with the global rating is only slightly higher 
than the zero order correlation for the sum of scores on major 
issues alone (multiple correlation = .90). 

The sum of the scores on the minor issues had a near zero cor- 


TABLE 5 
Correlations among Student and Paper Characteristics* 


Ре Major Minor 
Mean IssueSum IssueSum Law 
Grade of Scores of Scores GPA LSAT 
1. Ps’ Mean Grade — .89 .07 .62 45 
2. Major Issues (number) .54 71 —.05 .40 .21 
8. Major Issues (number of 
words) 64 +70 .07 .85 .19 
4. Major Issues (sum of 
Scores) .89 — .04 .60 .38 
5. Minor Issues (number) —.04 —.05 .89 —.11 —.07 
6. Minor Issues (number of 
words) —.19 —.18 .75 7.28... —.04 
7. Minor Issues (sum of 
scores) 07 m. 9% cm —.06 .10 
8. Legal Jargon .66 .63 .04 .48 .27 
9. Case Citation (number) .23 .09 —.17 .12 —.09 
10. Case Citation (quality) #11 .03 019 —.05 —.18 
11. Reaches Conclusions .04 .01 .15 .00 —.02 
12. Strength of Argument .36 „81. .07 171 .08 
13. Argues Both Sides .38 .32 —.14 .16 .18 
14. Mean Sentence Length .05 .00 .09 —.03 .09 
15. Syllables/100 Words —.20 = 15 .02 —.29 .07 
16. Prepositions/100 Words —.01 —.13 .05 .12 27 
17. Conjunctions/100 Words 11 .10 —.04 .13 —.09 
18. Transitional Phrases .48 .34 —.01 .21 .23 
19. Length .56 .61 .35 .32 .15 
20. Spelling Errors —.10 —.01 .01 —.06 —.17 
21. Grammatical Errors —.22  —.18 07 —.10 —.17 
22. Punctuation Errors —.14 .01 05 —.09 —.06 
23. Construction Errors —.35  —.29 00 —.05 —.07 
24. Total Composition Errors —.25 -.11 .04 —.1 —.17 
25. Handwriting .48 .21 14 9  .27 
26. Law GPA .62 :60 —.08 == +40 
27. LSAT 45 .38 .10 On 


* М = 79 for all correlations except those involving LSAT where N = 74. 
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relation (г = .07) with the global essay grade. The number 
of minor issues written on and the number of words written on 
minor issues have small negative correlations with the global 
rating. The results support the distinction made between major 
and minor issues. They also clarify the earlier finding of Klein 
and Hart (1968) that number of words was substantially cor- 
related with the global essay grade (r = .56). It is clear from 
the results in Table 5 that it is not simply the number of words 
written that is important, but the number of words written on 
major issues. 

The “пзе of legal jargon” was the only one of the three vari- 
ables labeled specifics of legal training that had a substantial 
correlation (.66) with the global essay rating. The two citation 
variables had correlations of only .23 and .11 with the essay 
grade. 

The only “readability” of style variable that was found to 
be substantially correlated with the essay grade was the stu- 
dent’s use of transitional phrases (г = .48). 

Several of the composition characteristics of the essays had 
moderate correlations with the average global grade. The ones 
that had significant correlations were as follows: handwriting 
quality (48), number of grammatical errors (—.22), and number 
of construction errors (—.35). 

Two characteristics of the type of argument used had sig- 
nificant correlations with the essay’s overall grade. These char- 
acteristics were “strength of argument for conclusions reached” 
(36) and the “tendency to argue both sides of the issue” (.38). 
The fact that these characteristics had essentially zero correla- 
tions with the number of major issues discussed may explain 
Klein and Hart’s (1968) earlier finding that general intellectual 
ability (as measured by the LSAT) and essay length (number 
of words written) were unrelated to each other, but that their 
Combination had a very high correlation with overall grade as- 
Signed to the essay. In other words, a student may be getting high 
grades on his essay by having (or acquiring) the ability to iden- 
tify the major issues, and then limiting his discussion to them 
(as indicated by the high correlation between length and grades 
for the major issues only) and presenting that discussion by 


“Pushing his position strongly while arguing both sides of the 
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major issues. Further, it is clear that a student is likely to write 
more words if he discusses both rather than just one side of a 
major issue. Additional support for this line of reasoning is that 
the multiple correlation between essay grades and the combination 
of arguing both sides of the question and the number of major 
issues discussed is .64, which is essentially the same as that 
reported by Klein and Hart for combination of total essay 
length and LSAT scores (R = .68). 


Conclusions and Comments 


The results of this study have led to the following conclusions: 
(1) The interrater agreement was generally better for issues 
judged to be of major importance to the essay question than for 
issues judged to be of minor importance, however, the overall 
agreement was high for both types (.90 and .83 for major and 
minor issues, respectively). (2) Students appeared to get better 
overall grades on their essays if they did the following things: 
(a) identified the major issues and limited their answers to 
them; (b) presented their arguments in an orderly manner (as 
indieated by their use of transitional phrases; (c) pushed for & 
particular conclusion strongly while arguing both sides of each 
issue; (d) used legal jargon; and (e) wrote neatly and did not 
make composition errors. 

Given the knowledge that these factors are important, it would 
seem that a student could readily improve his essay grades 
by paying attention to some of them. For example, he could use 
more legal jargon, argue both sides of each issue at length, and 
write neatly. Other factors, however, appear to be a function 
of more basic skills. Thus, measures of them may be incorporated 
into test batteries designed to predict success in law school. For 
example, an examinee could be presented with a situation and a 
list of issues related to it. His task would be to identify which 
issues were of major vs. minor importance. A second kind of 
measure, and one that would tap the student’s skills in present- 
ing an argument, might involve ordering a set of ideas in a logi- 


cal sequence and/or selecting the stronger of two or more argu- 
ments. 
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Appendix A 


(Note: FOR THE PURPOSES OF THIS QUESTION, DO NOT 
DISCUSS THE DAMAGE ISSUES RAISED BY THE FACTS.) 


Landlover owned five houses on a street in Milltown. He leased 
these to five different families at a monthly rental of $225.00 each. 


On May 1, Landlover called P, a reputable local house painter, 
and asked him if he would paint the houses. P agreed to do it for 
$500.00 per house if Landlover would reimburse him separately for 
the paint at the retail price of $6.00 per gallon. It was estimated 
that 100 gallons of paint would be required to complete the entire 
job. (Actually, P obtained the paint at the wholesale price of $4.00 
per gallon.) During the negotiations, Landlover had selected an off- 
shade of green which was obtained only by а careful mixing of 
several colors of paint in exact proportions, and Landlover had told 
P that it was important to him that all of the houses be of the same 
shade and P had assured him that this would be done. P also agreed 
to complete all of the houses by August 1. i 

On May 15, P completed the first house. He thereupon sent a bill 
to Landlover for $620.00 which was for 20 gallons of paint, at $6.00 
рег gallon ($120.00) and labor at $500.00. Landlover told him that 
he would pay nothing until all of the houses were completed, but 
upon P's insistence, he gave P $120.00 which represented the cost of 
the paint for the first house. 

On June 1, P completed the second house but did not request any 
payment. On June 15, he completed the third house and asked 
Landlover for $240.00 which was paid. He then informed Landlover 
that he would not complete the work unless Landlover agreed to 
pay him at the rate of $600.00 per house instead of the $500.00 
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agreed upon. He said that he was losing money and that he would 
only come out even if he got that amount for each of the five houses. 
Landlover agreed after calling several other painters who told him 
that they would charge $750.00 each for the two remaining houses 
and that they would not guarantee that they could match the color 
that was used on the first three. 

The fourth house was finished by P on July 1 at which time P 
was paid another $120.00. On July 10, Landlover visited his prop- 
erty and was surprised to see a crew of men working on the fifth 
house, whereas P had painted the other four by himself. P was not 
present and he asked one of the men why they were doing the 
painting. He was informed that P had decided to go to Resortown 
for a vacation and hired the XYZ Firm, a local company which 
painted houses, to complete the fifth house. Landlover discovered 
that P was paying $500.00 to XYZ for the job. He also noticed that 
the paint was of a slightly different shade, but when he brought this 
to the attention of the men they told him that P had mixed it before 
he left and since they had completed half of the house, they were 
not going to do it over. 

Landlover told the men to stop painting and to leave his property. 
He then contacted P in Resortown who told him that he had no 
intention of coming back to complete the work but he assured Land- 
lover that the job would be well done. “If it isn’t to your satisfac- 
tion" he said, “Г do it over when I get back in August.” Landlover 
told him that he was not satisfied with that as he wanted the work 
done before August 1 and that he already knew that it wouldn’t 
be right as the shade of paint was different. P then told him that 
he knew that the shade was a little off but explained that he could 
do nothing about that as the paint manufacturer had discontinued 
one of the colors he had been using to mix the paint. 

Landlover told him that he was going to get someone else to do 
the last house if P didn’t come back by the 15th of July. P said 
that he had no intention of coming back by that date just to finish 
Landlover's fifth house. 

On July 12, Landlover hired T to finish the house, agreeing to pay 
him $1000 for the job. T completed it on the 1st of August, having 
painted over all the work done by the XYZ Firm. Landlover paid 
T although he complained that the color was a little darker than the 
other houses. 
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DISCUSS FULLY THE LEGAL ISSUES (OTHER THAN 
DAMAGE ISSUES) BETWEEN LANDLOVER AND P. 


Appendix B 


Instructions for Graders 


m 


. The answers to the original question were typed. Fifteen papers 

were then analyzed to determine what “issues” the student 

actually discussed. A list of issues was then compiled and the 
remaining papers were analyzed and broken down under these 
issues. The papers were then cut up in accord with the analysis. 

You have the responses of 40 of the original 80 papers for each 

issue. In other words, after the statement of the issue, you have, 

at most, 40 statements each of which contains all that one student 
said about that issue. In most cases there are less than 40 papers 
as no student discussed all issues! 

3. Where a student discussed an issue in two or more places in his 
answer, each part of the discussion is included. The presence of 
three dots (...) between paragraphs indicates that his discussion 
of that issue was not continuous. 

. In some cases, the same part of а student’s paper appears under 
two or more issues. This was necessary to preserve continuity, 
and to avoid segmenting an answer so much that it became 
unintelligible. 

5. Assume, as should be the case, that everything that the student 

said about the issue appears under that issue. | 

6. You should grade these answers in the content of the question. 
In other words, you should try to determine how his response to 
this issue would affect his grade on this question. 

7. Use a scale of 1 through 5. THE BETTER THE PAPER THE 
HIGHER THE SCORE. The best papers should receive a score 
of 5; the worst papers a score of 1. 

| 8. If you would deduct credit for what the student has written, 
either because of the quality of an answer, or because you feel 
the issue itself irrelevant, or for any other reason, assign a grade 
of 1 and also place a D (for deduct) next to the score. 

. If there are 8 or more statements on a given issue, please use all 


5 grades at least once. 


№ 


E 


e 
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NOTE ON TESTS CONCERNING THE G INDEX 
OF AGREEMENT 


С. А. LIENERT? 
University of Diisseldorf, Germany 


Tue G index—suggested by Holley and Guilford (1964)—has 
proved useful in many respects especially in correlating persons 
for Q-factor analysis. Though it is algebraically related to the 
well known phi coefficient, its meaning is very different. It might 
well be called a homonymity-heteronymity-Indezx, since, by the 
formula 


_ (a t d) — (+0 
@ = N А 


it is the difference between the frequencies of the homonymly ав- 
signed cells (++ and ——) and the heteronymly assigned cells 
(+— and —+) in a four-fold contingency table. This name 
seems to be more descriptive than the name Index of agreement, 
which is restricting the index to the cases of agreement be- 
tween placements. Whether or not such an index is more adequate 
in describing the relationship between two characteristics in N 
individuals or between two persons on п items than is the phi 
or the tetrachoric correlation coefficient (c.f., Cliff, 1962; Holley, 
1964) is a matter of methodological consideration. Within this 
consideration, the formal question arises, how to decide whether 
or not an observed @ index is significantly different from an ex- 
pectation of zero under the null hypothesis. 
Testing the G Index Statistically 

Of course, @ may not be tested in the same way as а phi coeffi- 

1 The author is indebted to Dr. H. Klinger, Dept. of Statistics of the Univer- 
sity of Düsseldorf for reading the manuscript and making many helpful re- 
marks and suggestions. 
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cient, neither by the Fisher-Yates exact test nor by the x? asympto- 
tic test, because it is not based on the difference of the cross products 
ad — bc but on the difference of the cross sums (a + d) — (b + c). 
The null hypothesis is not that a8 — Ву = 0 but that (а + 8) — (B 
+ y) = 0, where а, В, у, 8 are the proportions in the four-fold popu- 
lation. It may easily be shown that а G = 0 might be associated 
with a significant Fisher-Yates-Test (Bradley, 1968, р. 195) e.g. 
when а = 5, b = 9, c = 0 and d = 4 (Р = 0,041). Consequently, the 
test applied is not adequately chosen. The question arises how to 
test G adequately? 

The answer is as follows: Under the null hypothesis, z = (а + d) 
and N — x = (b + с) are distributed binomially with the param- 
eters N and т where т = 1, i.e., according to the sign test of Dixon 
and Mood (1964). So z and N — x should not be too different if Ho 
is true but z and N — т are expected to differ if H, is not true, i.e., 
т 7^ № according to a two-sided alternative hypothesis, or т > %4 
according to a one-sided H;. Of course, a one-sided Н, should be 
stated only if the experimenter is able to make it appear reasonable. 

If N is large (>30) under H, the statistic x is distributed ap- 
proximately normally with mean Мт = N/2 and variance Nz(1 — 
т) = N/A such that the obtained deviation 


.£- N/2 

v N/A 
is distributed with zero mean and unit variance. If N is be- 
tween 20 and 30 the correction for continuity should be em- 
ployed by subtracting 14 from the absolute value of the num- 
erator above. 

While the sign test is an exact one the u-test is an asymptotic 
test. Tables for both tests (Bradley, 1968, Tables VII and VIII) 
give one-sided probabilities, which should be doubled if the al- 
ternative hypothesis is two-sided. 
| Тһе conditions of applying the sign test ог its normal approx- 
imation are fulfilled if (a) there are only two outcomes which are 
homonymous observations (++ or ——) and heteronymous 
observations (+— or —+) and no other observation is possible, 
(b) the outcomes are mutually exclusive as are homonymous 
and heteronymous observations, such that the probabilities of 
occurrence add to one, (с) the outcomes are completely in- 


u 
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= dependent, which is true if the outcomes are pairs of character- 


istics observed on N different individuals or objects, but which 
is questionable if the outcomes are pairs of persons judged as to 
n characteristics by only one judge and not as desired by n dif- 
ferent and equally competent judges, (d) the outcomes are а 
random sample from some large well defined population of out- 
comes, such that under H, every outcome has equal chance 
of being sampled, and that the chance of an outcome being re- 
presented twice in the samples equals zero. This is relevant only if 
pairs of characteristics are observed in a sample of № persons, 
not if pairs of persons are judged as to m characteristics, which 
are supposed to represent the total population of characteristics 
of interest to the experimenter. 


Testing the Difference between Two G Indices 


If two С indices G, and Gz come, as statistics, from two inde- 
pendent analogous four-fold tables, the question arises whether or 
not they may be supposed to stem from the same four-fold 
Population. In other words, the question arises, whether the propor- 
tion p, = z,/N, is equal to the proportion рг = 2/Nz under He. 

An етас! test for comparing two proportions under H, : ру = pa 
= т (not necessarily equal to zero) is the Fisher-Yates-test applied 
to the following four-fold table 


а + а, b, +e М, 
а + dz bz + с Ns 


a+d bcc N 
An asymptotic test of the same null hypothesis is the x? test which 


| 15 to be preferred if N = М, + Na is large and if none of the four 


frequencies is below five. Since xi with one degree of freedom is 
identical with the normal deviate u, we may test whether or not 


“= (a, + а,) (6. + ca) — (b. + с)( + ds) 
Ма + 56+ (0 +00 + 9/0 
is equal to or greater than its critical absolute value и = 1,96 and 
Шо = 2,58 for a two-sided Н, or wos = 1,65 and ша = 2,33 for a one 


Е Sided Hı. If Ny, Ne > 15 and/or one of the four frequencies is 
_ Smaller than three the correction for continuity is suggested: Sub- 


J 
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tract N/2 from the absolute value of the numerator from the __ 


formula above! 


Testing Homogeneity of Several G Indices 


If the experimenter is interested in testing whether or not 
more than two, say К, @ indices are independent sampling sta- 
tistics from an identical population of G indices with unknown 
т, he may establish a 2۰k contingency table of the following 
type 

2, = (a, + d) М, — а = (b + с) М, 

ть = (а, + dj) М, — 2, = (b, + о) N, 
and test the null hypothesis pı = 21/0; = +++ = Pr = 2/0, = m 
against the omnibus alternative hypothesis pı 7^ ··· ~ рь m for 
at least one p. The commonly used asymptotic test is the k-2 — x^- 
contingency test, by means of the so-called Brandt-Snedecor-for- 
mula. 

For the special case of а 3-2-contingency table with equal row 
sums, № = № = Ма, there is an exact test, tabulated by Bennett 
and Nakamura (1963). Multiple comparisons of each of two G in- 
dices implicit in а k-2 contingency table may be performed by suit- 
able methods for partitioning x? (Maxwell, 1961, р. 52-56). The 
same is true for trend comparisons, i.e., for testing py = ''' = Pr 


ae Pi < +++ < Py Or pi > +++ > рь (Maxwell, 1961, pp. 63- 
69). 


An Example 


The hypothesis that negative agreement between risk taking 
tendency (R) and need for achievement (A) is partially deter- 
mining accident-proneness, is tested in the following way: 

A sample of N = 86 steel-plant division-men, having had at 
least one work accident, was examined as to R and as to A 
dichotomously, giving the following four-fold table: 


A 
efi 

RY 17 29 
—|21 19 
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The hypothesis above seems to be true, since more (29 + 21 = 50) 
accident-workers disagree with respect to the two characteristics than 
agree (17 + 19 = 36) resulting in a @ = (50 — 36)/86 = —0, 15. 

The critical question is, whether or not the observed G of —0, 15 
is significantly different from zero agreement, if a significance level 
of 5 per cent is required for the one-sided test из = 0 (Н,) against 
ug «0 (Н)). 

Applying the asymptotic normal sign test we get z = 17 + 19 = 36 
and У — т = 50, such that и = (36 — 86/2)/ М 86/4 = —7/4, 64 = 
—1, 51 is numerically smaller than the critical us = 1, 65. So 
H, (no agreement) may not be rejected. 

The experimenter may ask another question, namely, whether the 
agreement between E and A is algebraically smaller in the N, = 86 
accident workers than it is in a sample of М, = 63 nonaccident 
workers. More precisely Н, (uc, = шо, = ив) is tested against 
Н, (ив, < џа,) one-sided. If the four-fold table of the nonaccident 
Workers is found to have the frequencies a = 18,6 = 11, c = 16 and 
d = 18, such that z, = 36 and N, — z, = 50 whilez, = 18 + 18 = 36 
and №, — z, = 11 + 16 = 27, and G, = —0, 15 in accident workers 
versus G, = (36 — 27)/(36 + 27) = +0, 16 in nonaccident workers. 
Testing the difference G, — Ga we arrange the following four-fold 
table 


д: = 36 N,—2 = 50 М, = 86 
t= 36 N= a= 27 №, = 63 


х= 72 М—»2=7 М = 149 

Since N = М, + М, = 149 and min (а, b, с, d) = 36 is large enough, 
Н, is tested asymptotically by the four-fold x? = (36-27 — 50-36)*/ 
(86-63-72-77), that is by evaluating the normal deviate u = (36-27 
— 50-36)/+/86-63-72-77/14 = —828/449 = —1, 84. Since |u] > 
1, 65 = из we accept H, instead of H.: The agreement between 
В and A is algebraically lower in accident workers (61 = —0, 15) 
than it is in nonaccident workers (G = +0, 16). Thus we may predict 
accident proneness, if a worker scores high in E and low in A or 
vice versa. 


Remark 


The С index of agreement is more appropriate in this case 
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than is a phi coefficient, since the variables A and E have been 


7 


dichotomized from non-normally distributed personality ques- Я 
tionaire scores (Ehlers, 1965, Table 1). G had been used as a 


measure of association instead of & phi coefficient from variables 
dichotomized at the median because several scores were tied 
with either Md; or Ма». 


Discussion 
Testing G against zero agreement is not equivalent to testing 
phi against zero association, since G as well as the G test rely 


on the cross sums of the four-fold frequencies unlike phi and the 
phi test which rely on the particular four-fold frequencies. Thus 


the С test involves less detailed information than the phi test _ 


and it might be concluded that the G' test is, other things being 
equal, less efficient in detecting association than is the phi test. 
Fortunately, this argument does not hold, since @ and phi are 
based on different null hypotheses and therefore are sensitive 
{о different types of association, such that direct comparison of 
the two tests in the sense of relative efficiency (Bradley, 1968, p. 
57) is inadmissible. This is true for the exact as well as for the 
asymptotic tests. 

While the difference between two Gs may be tested exactly 
under H,, following the well known hypergeometric distribu- 
tion, the difference between two phis may not be tested exactly, 
since the null distribution depends on the specific phis and must 
be evaluated by Fisher’s method of randomization applied to 
differences of phis gained by sampling of pairs of four-fold tables 
under H, (no difference in association as represented by phis). 
As to the procedure, the reader is referred to Bradley (1968, p. 
68). Thus, differences in agreement may be tested exactly and 
quickly by tables of the hypergeometric distribution or by using 
recursion formulas (c.f, Feldman and Klinger, 1963), while dif- 
ferences in association may, practically speaking, not be tested 
exactly. (Asymptotically, the difference between two phis may 
be tested like the difference between two r's using Fisher's z-trans- 
formation.) 

Deciding, whether to test, asymptotically or exactly for dif- 
ferences in Gs follow the general suggestions given by Sachs 
(1968, p. 343). If (a) all expected frequencies of the four fold 


| 


С. A. LIENERT i 287 


table are larger than three and if (b) the total of both samples 
N = № + М» is larger than 20, test asymptotically! If at least 
one of the two conditions is lacking, test exactly by the table of 
Finney et al. (1963). If the latter table is not available, test 
asymptotically by applying correction for continuity or, less 
conservatively, by applying Woolf’s (1957) likelihood ratio test 
to the cell frequencies corrected for continuity (see Sachs, 1968, 
p. 346). 

Remark: Testing the difference between two Gs either exactly (Ni 
+ Na < 20) or asymptotically (Ма, № > 20) may not be confused 
with testing for differences between two four-fold tables by an 
omnibus test devised by Le Roy (1962). This test is sensitive to dif- 
ferences of all sorts between two independent four-fold tables, 1.е., 
even to those differences which do not lead to different Gs (or phis). 
Consider, for example, a Table with a, = 10, 6; = 30, су = 0 and 
dı = 10 and another Table with аз = 10, b; = 0, co = 30 and й = 
10. Though both tables give identical Gs (and phis) Le Roy’s test 
proves to be significant because of the large differences b; — bz and 
C. — с. 


Summary 


From an inferential point of view the (7 index of agreement 
Suggested by Holley and Guilford (1964) has many advantages 
Compared with the traditional phi coefficient of association if 
both measures of association are, from the descriptive point of 
View, equally well justified: 

1. An observed G may be tested against zero G' expectation 
exactly by the well known sign test and asymptotically by the 
normal approximation to the binomial with т = 15. 

2. The difference between two observed Gs may be tested 
exactly by the Fisher-Yates test and asymptotically by the x? 
four-fold contingency test. 

The inferential procedures (1) and (2) were illustrated by an 
*xample from personality research and some additional considera- 
tions relevant to applications were discussed. 
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A GENERALIZED UPPER-LOWER ITEM 
DISCRIMINATION INDEX: 


ROBERT L. BRENNAN 
State University of New York at Stony Brook 


А discrimination index is one of the measures of item effective- 
ness typically calculated by test evaluators. Although a dis- 
crimination index, in the classical sense, is described as а measure 
of item-criterion correlation, such an index is frequently inter- 
preted as a measure of comparison between the number of students 
in an upper group who get an item correct and the number of stu- 
dents in a lower group who get the item correct. 

This paper is primarily concerned about the upper-lower types 
of indices which have been discussed by writers such as Bridg- 
man (1964), Ebel (1954a, b), Engelhart (1965), Findley (1956), 
Johnson (1951), and Long, Sandiford, et al. (1935). The first 
section discusses a rationale for such indices and the relationship 
between this rationale and the discrimination index D. In the 
second section a new upper-lower discrimination index, called B, 
is developed, and in the third section the exact distribution of 
this index is determined under the null hypothesis B = 0. The 
final section treats some problems in using and interpreting dis- 
crimination indices, especially in the criterion-referenced testing 


situation. 
The Discrimination Index D 


Given the large number of proposed item discrimination indices, 
it is often difficult to evaluate the effectiveness or usefulness of 
sara a . 

1'Тће research reported here was benefited by the support of the United 


States Naval Academy, USNA Contract No. N00161-70-C-0119, and the Office 
of Naval Research, ONR Contract No. N00014-67-A-0298-0032, See Brennan 


(1970) for a more detailed report of this research, 
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the various measures. Findley (1956) has proposed a rationale | 
for the evaluation of such indices that has at least three strong. 
points recommending it: (a) Findley’s rationale corresponds di- 
rectly with an intuitive notion of discrimination, (b) it is easily 


interpretable, and (с) it is distribution free. to 


Rationale for Discrimination Indices | 

Consider the following logie for discrimination indices. Suppose _ 
we have 10 students in an upper group and 10 students in a lower 
group for some test. The ideal item (according to Findley), in 
the sense of discriminating power, would be one for which all 10 _ 
students in the upper group get the item correct and all 10 students _ 
in the lower group get it wrong. That is, the item would distinguish _ 
each student in the upper group from each student in the lower | 
group; consequently, the item would be making 10 x 10 = 10? 
= 100 correct discriminations. "M 1 

Now, consider the case in which eight out of 10 students in _ 
the upper group get the item correct and three out of 10 students. 
in the lower group get the item correct. In this case, the item 
makes correct discriminations between the eight in the upper 
group who get the item correct and the seven in the lower group | 
who get the item wrong; ie. the item makes 8 x 7 = 56 сога | 
rect discriminations. By a similar line of reasoning, the item _ 
makes 2 X 3 = 6 incorrect discriminations, and (8 x 3) + (2X7) 
= 88 neutral discriminations. Therefore, the net amount of effec- 
tive discriminations is 56 — 6 = 50, which is 50 per cent of the _ 
100 possible correct discriminations ; ke. the discrimination index, _ 
D, for this item, according to the above rationale, is 0.50. 


Derivation of D 


Recalling the above example, it is also true that 8 — 3 = 5 в. 
50 per cent of the maximum possible value of this difference, 10. 


It can be shown that the relation between the latter and the 
former calculations is completely general: 


Let D = upper-lower discrimination index, 
U = number in upper group who get the item correct, 
L = number in lower group who get the item correct, 


and n = total number in each group; the total n in each group 
must be the same. MB 


Ш 
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This means that 
U(n — Г) = number of correct discriminations, 
L(n — 0) = number of incorrect discriminations, 


and т? = maximum number of correct discriminations. 
Then 
"us Um- D) — Ln — U) 
n 
aUis @) 
п 
IUUD (2) 
nn 


The index D was recommended by Johnson (1951), prior to 
Findley’s (1956) article concerning the previously described ra- 
tionale for such an index. 


The Discrimination Indez B 


The basic rationale behind the index D seems to be quite 
reasonable and useful; however, the necessity for using equal 
numbers of observations in the upper and lower groups seems 
overly restrictive and undesirable in many situations. 

The use of equal ns is largely а result of the popularity of cut- 
offs such as the median and the upper and lower twenty-seven 
Per cent. These symmetric cut-off points are, in turn, basically 
& result of the preoccupation of test theory with the normal 
distribution, which is, of course, symmetric? Unfortunately, how- 
ever, not all reasonable distributions of test scores are normal. 
Consequently, the use of symmetric cut-offs and equal ns in upper 
and lower groups is not necessarily justifiable, either from a 
mathematical or from a practical point of view. For example, 
in the case of mastery tests, many teacher made tests, and 
criterion-referenced tests (see, for example, Glaser and Klaus, 
1963, and Popham and Husek, 1969), the test constructor often 
expects most of the students to get most of the items correct, 
Yielding a distribution of test scores that is negatively skewed. 

2 Kelley (1939) notes that the upper and lower 27 per cent of the cases con- 


Stitute optimal groups for determining discrimination indices only when the 
criterion test scores are normally distributed. 
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In these circumstances, the requirement of equal ns in the upper 
and lower groups seems artificial, at best. 

Furthermore, regardless of the shape of the distribution of test 
scores, it seems reasonable to allow the test evaluator the freedom 
to choose the cut-off points between the upper and lower groups. 
Only he can determine the cut-off points that yield meaningful 
and interpretable upper and lower groups based upon his con- 
sideration of the test content, student population, and overall 
expectations for student performance on the test. When the test 
constructor is free to choose the cut-off points, there is, clearly, 
no reason to expect that the resulting groups will be of equal 
size. 

Thus, in effect, we would like a discrimination index that is 
similar to D but that allows for the use of unequal ns in the 
upper and lower groups, thereby giving the evaluator the freedom 
to choose appropriate cut-off points between these groups. By а 
line of reasoning similar to that of Findley (1956) and Johnson 
(1951) such an index can be constructed. 


Let В = the index under consideration, 
О = the number of students in the upper group who get 
the item correct, 
L = the number of students in the lower group who get 
the item correct, 
та = the total number of students in the upper group, 
and та = the total number of students in the lower group. 
This means that 
О (п. — Г) = the number of correct discriminations, 
L(n, — U) — the number of incorrect discriminations, 


and тата = the total number of possible discriminations. 
"Therefore, 
ва U(n, — L) — L(n, — U) 
NMa 
_ Um — Ln, 
NNa @ 
UNE 
z Mm M @ 


Thus, В turns out {о Бе different from D only in that В is the 
difference between two proportions based on unequal ns. How- 
ever, а В of 0.30 is interpreted verbally in the same way as a D 


ee ^^ 


AGER ____ = 
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of 0.30; namely, 30 per cent more students in the upper group got 
the item correct than in the lower group. B is, of course, equiv- 
alent to D when nı = na. 


Testing the Significance of B 

Since the index B is basically the difference between two propor- 
lions based upon unequal sample sizes, the data can be considered 
in the framework of a contingency table as in Table 1. It is well 
known that Н, : В = Ола — L/n = 0 can be tested through the 
use of the x? or normal distribution when n; and па are fairly large 
and none of the cell entries in Table 1 is less than five (see Snedecor 
and Cochran, 1967, p. 221). Unfortunately, however, these con- 
straints are not always realistic. Very often test evaluators must 
work with small numbers of students for evaluating tests, especially 
tests that characterize self-instructional programs. Also, it is not 
uncommon for one of the cell entries in Table 1 to be less than five 
for several (or possibly many) items on a test, especially a mastery 
test or a criterion-referenced test. 

In such cases, Fisher’s Exact Test has been recommended as 
an alternative to the x? or normal test. However, there seem to 
be at least two related problems associated with the use of 
Fisher’s Exact Test in these situations: 

(a) All marginals (not just m and 72) are assumed to be fixed. 
Since (U + L)/(m + та) is an estimate of the difficulty level 
associated with a particular item, fixed marginals implies that 
the significance level for the discrimination index B is very much 
dependent upon the observed difficulty level of the item. Flannagan 
(1939) has noted that it is desirable for a discrimination index to be 
unaffected by the difficulty level associated with a particular item. 
It likewise seems desirable for the test of significance of a given 
discrimination index to be unaffected by the observed item difficulty 
level (ie., to be independent of the column marginals in Table 1). 


TABLE 1 
Contingency Table for B 
Correct Wrong Total 
Upper Group U m= U т 
Lower Group L m-L та 
U+L т d m т +m 


о SR SE e 
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(b) Using Fisher’s Exact Test it is possible for some values 
of B based upon given values of та and nz to be classified as 
either significant or non-significant depending upon the values of 
U and L. For example, suppose that та = 15 and na = 10 in 
Table 1. It is easily verified that there are six sets of values for 
U and Г that give the same discrimination index, В = 047; 
two of these sets of values for U and L cause B to be classified 
as nonsignificant while the other four sets of values cause B to be 
classified as significant. Thus, it is quite possible for one item 
with a given discrimination index to be declared significant and 
for another item with the same discrimination index to be de- 
elared nonsignificant, even though both items are evaluated on 
the basis of the same test data. 

The difficulties described in (a) and (b) are primarily the re- 
sult of keeping all marginals fixed. Referring again to Table 1, 
suppose that we allow the column marginals to vary while keep- 
ing the row marginals fixed. Also, let us assume that the obser- 
vations are independent. (This latter constraint is not an ad- 
ditional assumption since independence is also assumed when 
using the x? test and Fisher’s Exact Test.) Under these condi- 
tions, the cell frequencies associated with both the upper and 
lower groups are distributed according to the binomial distri- 
bution. That is, if X is the binomial variable associated with the 
upper group (row 1) and У is the binomial variable associated 
with the lower group (row 2), then 


PEt) = (ba р)", поо, 0 
à P(Y = D= (ва وات‎ L-20,1,---,n, (0) 
and 
Р(Х =U) and (У = L) 
т, 
where ру is the population probability of success (correctness) 


in the upper group, and р» is the population probability of suc- 
cess (correctness) in the lower group. 
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Since the same value for B can often be generated by several 
of values for U and L, 


РВ = № = 5700 = U) and (Y = р), (8) 


"where the sets of values (U, L) that generate any given В = k 
are determined by all integral solutions for U and L of 


mU-nL-k U-0,L-:,m L=0,1,---,m (9) 
Under these assumptions, it is clear that 


вв = (Z-Z) -n-p 00 
апа 


D | а — р), (1) 
т na 

In order to test the null hypothesis B = 0, we must identify the 
distribution of B when E(B) = 0 in the population. This occurs 
When р; = ps. There are, of course, an infinite number of possible 
values for p; and ро; however, it is the claim of this writer that it is 
often reasonable to specify ру = ра = 0.50 because: 

(a) The computations are simplified considerably. Equations 
7,10, and 11 become, respectively, 


Р(Х = U) and (Y = D] = (уо во, (12) 


Е(В) =0 (13) 
апа 
Ү(В) EL оза + n). (14) 


(b) The distribution of B is somewhat easier to work with 


Since it is symmetric. i 
(c) The variance of В is maximized when ру = Рг = 0.50. This 
Means that the test of the null hypothesis B = 0 vil be conserva- 


з 


EE ng - DENM E VE RE E “о donc I E Ра 

As 
$ 

| 
Б ч 

= 5 

ESj 

FT 


Using Equations 8, 9, and 12, it is relatively easy, although 
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laborious, to calculate P(B = К) for every possible value of k, 
given n, and па. The resulting probability distribution is the exact 
distribution of B under the null hypothesis В = 0 when p, = ps = 
0.50. From this distribution one сап determine critical values of B 
for various levels of significance. 

Although the distribution of B for p, = ps = 0.50 has several 
desirable properties, other values for p, (and pz) may be more 
reasonable in certain circumstances. For example, if the distribution 
of test scores is very negatively skewed, it may be more reasonable 
to choose, say, pı = p» = 0.80 or pı = рг = the mean of the distri- 
bution of test scores. It should be noted, however, that аз p; and p» 
depart from 0.50, the critical value of B decreases for given t, nz 
and а. 

For sufficiently large values of та and na, B is distributed ap- 
proximately as 


I^ = СИЕ ЕВ , 15 
2 Е ШЫ (15) 
А nma Pp р, 


where р = pi = рз. For smaller values of т; and na critical values 
for B should be determined from the exact distribution? The Ap- 
pendix gives the two-tailed critical values of B for p < 0.05 апар 
< 0.01 when 10 < m, n; < 30 and 7, = P2 = 0.50. 


Some Considerations for Using Discrimination Indices 
Upper-Lower vs. Correlational Indices 


„Та general, upper-lower discrimination indices and correlational 
discrimination indices are based upon different rationales: the 
former reflect the number of discriminations made by an item, 
while the latter are a measure of association between item 
Scores and scores on some criterion variable (usually total test 
score). Clearly, only the test evaluator ean determine which 
rationale is more appropriate for a given test; however, in the 
opinion of this author, the upper-lower indices are to be prê- 


8 Brennan (1970) gives a FORTRAN com gram i " 

пп puter pro, f leulating sev 

eral critical values of B for given ти, ma, pi, HE Ра. Тыш prograia is also 
capable of generating entire tables of critical values of B. 
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ferred on the basis of calculating ease and interpretability. It is 
much easier and quicker to determine the number of discrimina- 
tions made by all items on a test than it is to calculate a correla- 
tion coefficient for every item. As far as interpretability is con- 
cerned, it is simply not true that r = .60 is twice as large as 
г = 30; however, В = .60 is, in a sense, twice as large as В = 
30. That is, В = .30 means that the difference between the per- 
centages of students getting the item correct in the upper and 
lower groups is 0.30 in favor of the upper group, while B = .60 
means that this difference is 2 X 0.30 = 0.60 in favor of the 
upper group. 

Part of the motivation for the development of a new upper- 
lower discrimination index originated from recently published 
literature concerning criterion-referenced tests. Whereas the in- 
terpretation of norm-referenced tests often relies upon an assumed 
normal distribution of test scores, such a constraint seems 
neither reasonable nor necessary for criterion-referenced tests, 
mastery tests, and even some teacher-made tests. As previously 
indicated, when the distribution of test scores is not normal, the 
index D does not seem to be as appropriate as B. Also, the in- 
dex B allows the evaluator considerably more freedom in defin- 
ing upper and lower groups. 

Some of the correlational type of discrimination indices are 
also affected by non-normal test score distributions. For exam- 
ple, lack of normality precludes the use of the tetrachoric cor- 
relation coefficient, r,. Also, unless one is willing to assume that 
student responses to dichotomous items are essentially continu- 
ous and normally distributed, the biserial correlation coefficient, 
ть, should not be used. Neither the point biserial correlation co- 
efficient, ту, nor the phi coefficient, rg, necessitates normality 
assumptions; however, when testing the significance of rg one 
encounters the previously discussed problem concerning the use 
of x distribution and Fisher's Exact Test. 

Recently Brennan (1970) compared B with D and тр for 
six sets of synthetic data that generated a range of test score 
distributional forms from a normal distribution to a very neg- 
atively skewed distribution. For each set of synthetic data, the 
discrimination indices B, D, and ry» were calculated for every 
item. Also, for each set of synthetic data, the test items were 
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rank ordered on the basis of each one of the discrimination 
indices. The following are some of the results of this analysis: 
(a) the number of significant items varied considerably with 
the index chosen and the degree of skewness; (b) rj; and D con- 
sistently declared fewer items to be significant as the distribution 
of scores became more negatively skewed; (c) the number of 
significant items resulting from the use of B was very much de- 
pendent upon the cut-off point chosen for differentiating between 
the upper and lower groups; and (d) the rank orders of the items 
were very much dependent upon the index used as a basis for 
ordering. Perhaps none of these results is startling; however, 
they do demonstrate that та, D, and В should not be used inter- 
changeably when the distribution of test scores is not normal. 


Interpretation of Discrimination Indices with 
Criterion-Referenced Tests 


In general terms, we can say that each item on a test can be 
classified into one of four mutually exclusive categories: 


(a) passed by the upper group; failed by the lower group; 

(b) failed by the upper group; passed by the lower group; 

(c) passed by the upper group; passed by the lower group; and 
(d) failed by the upper group; failed by the lower group. 


In terms of discrimination indices: (a) indicates a positively 
discriminating item; (b) indicates a negatively discriminating 
item; (c) indicates a nondiseriminating item with high difficulty 
level; and (d) indicates а nondiscriminating item with low dif- 
ficulty level. 

For norm-referenced tests, nondiseriminating items are unac- 
ceptable, negatively discriminating items are usually unaccept- 
able, and positively discriminating items are acceptable. The 
argument usually presented against nondiscriminating items is 
that such items tell us nothing about differences among students 
—a very reasonable argument for a test designed to identify 
student differences. The argument usually presented against neg- 
atively discriminating items is that such items must be poor 
since the best students seem to get them wrong, while the poor 
students get them right; ie., such items are mis-identifying stu- 
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dent differences. Positively discriminating items are acceptable, 
since they identify student differences in the expected (and de- 
sired) direction. 

For criterion-referenced tests, the interpretation of discrim- 
ination indices needs to be modified. According to Popham and 
Husek (1969), 


“An item which doesn’t discriminate need not be eliminated. 
... If it reflects an important attribute of the criterion, such 
an item should remain in the test. . . . 

A positively discriminating item is just as respectable in 
a criterion-referenced test as it is in a norm-referenced test, 
but certainly not more so. In fact, the positively diseriminat- 
ing item may point to areas of instruction (if the criterion 
measure is assessing the effects of instruction) where the pro- 
gram is not succeeding well enough. . . . 

However, negatively discriminating items are treated ex- 
actly the same way in a criterion-referenced approach as they 
are in a norm-referenced approach. . . . When one discovers 
a negative discriminator in his pool of criterion-referenced 
items, he should be suspicious of it and after more careful 
analysis can usually detect flaws in such an item (pp. 6-7).” 


It seems correct to say that a nondiscriminating item need not be 
eliminated if it reflects an important objective of the test, but this 
statement needs clarification. A nondiscriminating item occurs 
whenever the percentage of students in the upper group getting 
the item correct approximately equals the percentage of students 
in the lower group getting the item correct. If both these per- 
centages are high, then we have the best condition possible; i.e., 
most of the students get the item correct (high difficulty level), 
and the item seems to be equally effective for both the upper and 
lower groups. If, however, both percentages are low, then most 
of the students are not getting the item correct (low difficulty 
level). In this case, some revision is indicated simply on the 
basis of the item’s difficulty level. Thus, in order to determine 


“Оп some norm-referenced tests, such as psychological inventories, nega- 
tively discriminating items may be acceptable, depending upon the desired 
Tesults for the item. 
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whether or not a nondiscriminating item needs revision, we 
must take difficulty level into account. It, therefore, becomes 
important that the test of significance for a discrimination index 
be independent of the observed difficulty level for the item. (As 
previously indicated, the critical values for B do not depend 
upon observed difficulty level.) 

Popham and Husek (1969) indicate that negatively discrim- 
inating items are certainly unacceptable for criterion-referenced 
tests. However, they neglect to mention that either the item or 
the instructional material or both may need revision. In general, 
when item analysis data (for criterion-referenced tests given 
only after instruction) indicate a need for revision, it is usually 
very difficult, if not impossible, to specify whether the item or 
the instruction teaching the objective(s) of the item (or both) 
should be revised. 

It is hard to believe that a positively discriminating item is 
"just as acceptable in a criterion-referenced test as it is in a 
norm-referenced test (Popham and Husek, 1969, p. 6).” When 
an item is positively discriminating there is one identifiable 
group of students (lower group) performing less effectively than 
another identifiable group (upper group). If, however, the in- 
struction preceding the criterion-referenced test is indeed equally 
effective for all students, then we would expect any incorrect 
Tesponses on an item to be distributed essentially randomly 
among the two groups (upper and lower); in this case, the dis- 
crimination index for the item should be essentially zero, not 
highly positive. Thus, a positive discrimination index may indi- 
cate that the instruction needs to be revised in order to be more 
effective for the lower group. If, on the other hand, the instruc- 
tion is actually equally effective for both the upper and the 
lower groups, then a positive discrimination index for an item 
indicates that the item needs revision. Clearly, then, a positive 
discrimination index should not be considered “ideal,” since it 
indicates that either the instruction or the item need revision. 

In summary, the ideal item in the criterion-referenced test- 
ing situation is the item with a nonsignificant discrimination 
index and a high difficulty level; items that discriminate nega- 
tively are clearly unacceptable; and items that discriminate posi- 
tively usually indicate a need for revision. 
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APPENDIX 
Two-tailed Critical Values of B 
m 0.05 0.01 m m 0.05 001 м m 0.05 0.01 
10 .500 .700 10 11 .436 .545 10 12 .450 .567 
13 .423 .546 10 14 .429 .52 10 15 .433 .567 
16 .400 .525 10 17 .400 .524 10 18 1.389 .522 
19 .389 .439 10 20 .450 .550 10 21 .376 .476 
22  .391 .491 10 23 .370 .496 10 24 .375 .492 
25  .380 .500 10 26 .369 .485 0 27 .363 .478 
28  .371 0.479 10 29 2.359 .459 10 30 0.400 .500 
11 .545 .636 11 12 .402 .561 11 13 .420 .524 
14 0.403 .532 11 15 .400 .509 11 16 0.386 .511 
17 .880 .497 11-18 .379 .490 11 19 .383 .493 
20 2377 .486 11 21 .355 .489 11 22 .409 .500 
23 0348 .474 11 24 .360 .455 11 25 .360 .465 
26 2357 .62 11 27 03600 .41 11 28 .354 .455 
29 351 .455 11 30 .355 4.455 12 12 .500 .583 
13 378 0526 12 14 .393 .500 12 15 .400 .500 
16 306 0500 12 17 .382 .490 12 18 .389 .500 
19 34 .474 12 20 .367 .483 12 21 .369 .476 
22 36 0.470 12 23 .362 .449 12 24 .375 .500 
25 353 437 12 26 0.340 .449 12 27 .352 .444 
28 35 .452 12 29 .342 .445 12 30 .350 .450 
13 1462 538 13 14 .401 489 13 15 .379 .508 
16 375 476 13 17 362 .480 13 18 .368 .462 
19 356 457 13 20 .346 .450 13 21 .348 .462 
22 36 455 13 23 .348 .445 13 2 0.340 „442 
25 335 449 13 26 .385 .462 18 27 .328 .439 
28 341 1423 13 2 324 .427 13 30 .328 .426 
14 1499 51 14 15 3 .462 14 16 .357 .478 
17 353 Ап 14 18 357 .460 14 19 „350 .455 
20 30 40 14 21 (357 .452 14 22 .338 1.448 
23 339 441 м 24 1333 .429 14 25 2329 .42 
90 ‘345 434 14 27 32 421 14 28 .357 464 
20 308 40 14 30 2319 .419 15 15 .400 „538 
16 ‘358 483 15 М 353 .455 15 18 .356 .467 
19 ‘340 442 15 20 .350 .450 15 21 .343 .438 
22 "Зот 464 15 23 35 .482 15 24 1325 .433 
25 1333 47 15 26 2323 .418 15 27 .326 .415 
28 314 412 15 29 .322 .398 15 30 4333 .433 
16 1438 50 16 17 .838 .456 16 18 .354 .431 
19 ‘332 1438 16 20 .337 .438 16 21 .324 .420 
22 1330 420 16 23 .326 .427 16 24 .333 .438 
25 эз 412 16 26 31 .409 16 27 .308 .407 
28 313 ап 16 29 .310 .403 16 30 .300 :392 
17 1412 мп 17. 18 .320 .435 М 19 .337 443 
20 i321 4л 17 21 317 423 17 22 .318 .420 
28 зт .412 17 24 .316 .412 17 25 -308 .407 
26 305 .400 17 27 .301 .397 17 28 .303 .397 
99 3% 380 17 30 .300 .396 18 18 .389 .500 
19 307 .412 18 20 .322 .428 18 21 35 „48 
22 зз .409 18 23 312 .411 18 24 .319 AIT 
25 130 306 18 26 .303 .307 18 27 .315 .407 
28 1208 389 18 2 .299 .389 18 30 .300 „389 
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Two-tailed Critical Values of B (Continued) 


т m 0.05 0.0 m 0.05 0.01 т m 0.05 0.01 
19 19 .368 .474 19 20 .332 .395 19 21 .308 .409 


s 


19 22 .316 .409 19 23 .304 .391 19 24 .300 .395 
19 25 .299 .392 19 26 .304 .395 19 27 .294 .384 
19 28 .293 .382 19 29 .290 .377 19 20 .288 .374 
20 20 0.350 .450 20 21 .317 .414 20 22 .300 .391 
20 23 .304 .400 20 24 .300 .392 20 25 .300 .390 
20 26 .296 .385 20 27 .293 .381 20 28 .293 .386 
20 29 .290 .374 20 30 .300 .383 21 21 .333 .429 
21 22 .305 .396 21 23 .286 .379 21 24 .298 .387 
21 25 .293 .381 21 26 .284 .379 21 27 .291 .376 
21 28 .208 .381 21 29 .282 .373 21 30 .286 .367 
22 22 .318 .409 22 23 0.292 .381 22 24 .303 .390 
22 25  .284 .360 22 26 .287 .378 22 27 .286 .369 
22 28 .282 .364 22 29 .281 .362 22 30 .276 .364 
23 23 0.348 .435 2 24 .281 .366 23 25 .292 .379 
23 26 .276 .363 23 27 ,283 .304 23 28 .276 .362 
23 29  .274 .301 23 30 .275 .355 24 24 .333 .417 
24 25 .270 33 24 26 .282 .365 24 27 .273 .370 
24 28 .280 .357 24 29 .274 .358 24 30 .275 .358 
25 25  .320 .400 25 26 .262 .371 25 27 .273 .353 
25 28 .277 1.357 25 29 .268 .348 25 30 .273 .353 
26 26 .308 .385 26 27 .282 .359 26 28 .264 .343 
26 29 .272 .350 2 30 .262 .346 27 27 (206 .370 
27 28 .212 31 27 2 .258 .34 27 30 .267 .34l 
28 28 .286 0.303 28 29 .264 336 28 30 .250 .345 
29 29  .276 .370 29 30 .255 .35 30 30 2300 .367 


Note,—The critical values of B contained in this table were calculated under the assumption 
that pi = p: = 0.50, for a two-tailed test of significance. 
Note that the values of ти and ла aro interchangeable. 
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CORRECTING CORRELATIONS FOR RESTRICTIONS 
IN RANGE DUE TO SELECTION ON AN 
UNMEASURED VARIABLE 


N. DALE BRYANT 
Teachers College, Columbia University 


SUNANDA GOKHALE 
Albany, New York 


Tum size of a correlation coefficient is dependent in part upon 
the variability of the measured values in the correlation sample. 
Any time that a sample is restricted in range on either or both 
of the measures, the correlations between those two measures 
will tend to be lowered as compared to the same correlation 
based upon a representative sample of the population. If pre- 
diction within the restricted sample is the purpose of the cor- 
Telation, then the obtained value is the meaningful and correct 
one. However, if, for some reason, it is not possible to correlate 
the variables using an unrestricted sample, we can infer the 
relationship between the two measures irrespective of the restric- 
tion if we correct the correlation for the effect of the restriction 
in range. For example, if, in a sample of bright students, reading 
achievement and academic grades show only a .2 correlation, 
we cannot infer that this is the general relationship between 
reading and school grades. Since a high IQ group will tend to 
make high grades and will also tend to be high on reading abil- 
ity, there is likely to be severe restriction in range on both 
variables. For prediction within the high 10 group, the .2 cor- 
relation is appropriate, but to infer beyond the sample, & cor- 
rection for restrictions in range is necessary. Guilford (1965, 
Pp. 341-345) gives three formulae, attributed to Karl Pearson, to 
correct a Pearson product-moment correlation coefficient for re- 
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striction in range when restriction results from selection on k 
of the two variables being correlated or on some measured 
variable. The assumption must be made that the variables 
normally distributed in the population. 


Problem 


In many clinical and other settings, the sample is obviously 
restricted in range on different variables, but the basis for th 
restrictions (i.e. the selection variables) is complex, unknown, or : 
unmeasurable. Examples of such sampling might be children. 
coming to a particular clinie, cases receiving а particular diag 
nosis, or individuals exhibiting a particular behavior. In all 
these cases, the samples may show restrictions in range on vari- 
ables being correlated, but the basis of the restrictions cannot be 
reduced to a measurable variable. In these instances, the form- _ 
ulae presented by Guilford cannot be used. It is possible, how- 
ever, to correct for restrictions in range, even though the selec- | 
tion variable is unknown or unmeasured, by using information | 
about the extent of the restriction on each of the two variables _ 
being correlated. 

This paper presents a formula whereby a Pearson product- _ 
moment correlation can be corrected for restrictions in range 
for these special but very frequent situations where the basis of 
selection is unmeasured but where the extent of restriction for 
each of the two measures being correlated is known and where 


the variables are assumed to be normally distributed in the 
population. } 


Formula for Use When Restrictions Result 
from Complex or Unmeasured Variables 


Starting with Guilford’s formula for correcting rı2 for re- - 
striction in range, we can rewrite his Formula II so that it cor 
reots a correlation rg, where Testriction is produced by selection 
on the basis of variable 3 and there is knowledge of the stand 
ard deviations for variable 1 in both the restricted and unre 
stricted samples. Similarly, we can rewrite his Formula I 80 
that it corrects a correlation rg where restriction is produced 
by selection on the basis of variable 3 and there is knowledge 
of the standard deviations for variable 3 in both the restricted 
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and unrestricted groups. By equating these two formulae and 
squaring and simplifying them, we can obtain an equivalent value 
for the ratio of unrestricted to restricted variances on variable 
3, expressed in terms of the ratio of unrestricted to restricted 
variances on variable 1 and the correlation тај. The same pro- 
cedure can be followed by rewriting Formulae I and II to cor- 
rect таг so as to obtain an equivalent value for the ratio of un- 
restricted to restricted variances on variable 3, expressed in 
terms of the ratio of unrestricted to restricted variances on vari- 
able 2 and the correlation тзг. Thus, the information about re- 
strietion on variable 3 is expressed in terms of information about 
the variables 1 and 2 and the correlations тз: and 732. 

These equivalent ratio values described above can be substituted 
into Guilford's Formula III (for Ra), where restriction is produced 
by selection on the basis of variable 3 and there is knowledge of the 
Standard deviations for variable 3 in both the restricted and un- 
restricted groups and where т; and rs, are known. However, since 
there are two estimates of the ratio of unrestricted to restricted 
variances on variable 3, we must express the value as the square root 
of the product of the two estimates (viz., а = Ма X a). 

The resulting formula for the corrected correlation (Ry) is given 


below: 
2 2 


This formula does not require all of the information necessary for 
Guilford’s Formulae I, II, and III, but it сап be used to obtain 
a product-moment correlation coefficient that is corrected for 
restrictions in range (Riz) knowing only the uncorrected cor- 
relation (rjj), the standard deviations of the two variables in 
the restricted samples (s; and 82), and the standard deviations 
of the two variables in the unrestricted sample (оз and оз) ^ 


Examples of Use ој the Formula 
In а clinical sample of children, it was noted that a particular 


1In kindly checking this derivation, Dr. Rosedith Sitgreaves, Principal Ad- 
visor, арт т and Statistical Methods Area, Psychology ТО 
ment, Teachers College, Columbia University, pointed out that the formi 
could be obtained somewhat more directly without recourse to the Оор 
formulae. The senior author will be happy to send upon request both the 


original and Dr. Sitgreaves’ derivations to anyone requesting them. 
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measure (the Coding subtest on the Wechsler Intelligence Scale 
for Children) was consistently lower than the average of the 
other intelligence subtests. The sample consisted of children of 
average or above average IQ who were brought by their par- 
ents to a clinic because school remedial procedures were not 
correcting the children’s severe reading retardation. To study the 
nature of this lowered performance, the variable, Coding, was 
correlated with other reference variables such as the Perceptual 
Speed Test of the Primary Mental Abilities Test Battery. The 
correlation of a reference variable and the Coding subtest needs 
to be compared to equivalent values in a sample representative 
of the population as given in other research studies. In order to 
make the correlation based upon the clinical sample comparable 
to the correlation based upon the sample representative of the 
population, it is necessary to correct for restrictions in range, 
since both Coding and the reference variable, Perceptual Speed, 
show consistently lower scores than are normally found in a 
presumably representative sample from the population. The spe- 
cific factors responsible for the restriction in range cannot be 
measured, since coming to a clinic involves much more than poor 
reading. In both Coding and Perceptual Speed, we can assume 
normality of distribution within the population. 

The values obtained for the clinic sample are as follows: 712 = 
„40, where 1 and 2 represent Coding and Perceptual Speed respec- 
tively; 812 = 2.59 and s? = 186, where s? is the variance for the 
clinic sample. Equivalent values for normative samples of appropri- 
ate age as given in the manuals for the respective tests are о12 = 9 
and оз? = 289, where c? is the variance based upon the normative 
samples. Substituting in the final formula given above: 


pay 2.59 ى‎ 186 

в, = 404/72 х ê + (1 - 259) x (1 — 188) - 68 

A study, based upon a “normal” sample of eighth grade chil- 
„тер (which is roughly comparable to the grade placement of 
the clinic sample) and having variances similar to the popula- 
tion values, reported that rj; = 37. 

By using the correction for restrictions in range, it is possible to 
compare the .68 in the clinic sample with the 37 in the normal 
sample. It suggests that there is a higher degree of relationship 
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between these two measures in the clinic sample (and confirms 
certain conclusions drawn from clinical observation). While it is 
beyond the scope of this paper to comment upon the interpretation 
of this finding, it is apparent that interpretations could be made 
that could not have been made if there had been no correction for 
restrictions in range. 

The illustration above is of a case where a comparison is to be 
made between a value derived from an unrestricted sample and one 
derived from a restricted sample. The values have to be expressed 
in comparable terms, so the correction for restrictions is necessary. 

Another example of a case where the correction for restrictions 
in range is necessary is when a correlation is obtained on a special, 
restricted sample and must be generalized to the population. An 
example of this might be a study of the relationship between the 
amount of a particular chemical in the blood and the frequency 
of hallucinatory-type activity. Since this is hard to study in a 
nonclinical population, we might study it in a sample of in- 
dividuals diagnosed as schizophrenic. If schizophrenics seldom 
have a low concentration of the chemical in their blood and if 
they tend to show more frequent hallucinatory-type activity 
than would be true for the total population, then both of these 
variables are restricted in range. A correlation between the two 
variables in the schizophrenic sample can be used to infer what 
the relationship would be in the total population if it is assumed 
that the same relationship holds true for lower levels of the 
chemical and less frequent hallucinatory-type activity and that 
the clinical sample merely represents one end of a distribution 
on these two variables, which are normally distributed in the 
population. While these assumptions might not be justified, it 
is evident that, if they are made, the correlation based upon 
the schizophrenic sample would have to be corrected for re- 
strictions in range in order to infer the relationship in the popu- 
lation. The basis of the selection of the sample is complex, and, 
unless a measure of the selection variable can be obtained, it 
would be necessary to use a formula such as the one presented 
in this paper. 

Another example of the application of the formula would be its 
Use in estimating the validity of a test where the criterion and 
test scores are available on the same individuals only in a re- 
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stricted sample where the basis of the selection is not clear 
not measured. If the variance of the test is known for some sal 
ple that is representative of the population and the variance of 
the criterion is known for some other sample representative о 
the population, the formula can provide a correction to estimate 
the validity of the test in an unrestricted sample. 


Summary 

There are many times that Pearson product-moment со 
tions are based on clinical samples or other special groups whi 
there are restrictions in range on the variables being correlate 
and where the basis of the selection that causes the restrictions 
is unknown or unmeasured. It is often necessary either to come 
pare the correlation with values derived from a sample гер 
sentative of the population or to infer from the special samp 
the nature of the relationship that exists between the two va 
ables within the total population. In such cases, if the assump- 

` tion can be made that the variables are normally distributed i 

the population, the formula presented in this paper is applicable 
‘in correcting the correlation coefficient for restrictions in range. | 
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А COMPARISON OF FIVE VARIABLE WEIGHTING 
PROCEDURES! 


JOHN С. CLAUDY? 
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MurmIPLE regression is one of the statistical techniques most 
widely used by psychologists, educators, and social scientists in 
general. This utilization is becoming increasingly widespread as 
electronie computers become more universally available. In gen- 
eral, multiple regression procedures are applied to а sample of 
data and the sample multiple correlation coefficient is taken as 
an estimate of the multiple correlation coefficient in the popu- 
lation from which the sample was drawn, and the sample beta 
weights are taken as estimates of the population beta weights. 
It is these population parameters which are usually of interest, 
and not the sample statistics in and of themselves. 

In spite of the fact that this technique is so widely used, the 
Situation seems little improved from what % was 20 years ago 
when Cureton (1950) wrote: “№ is doubtful that any other 
statistical techniques have been so generally and widely mis- 
used and misinterpreted in educational research as have those 
of multiple correlation (p. 690)." Nor is there any reason {о ex- 
pect improvement in this situation, and indeed, the ready avail- 
ability of standard computer regression programs may be mak- 
ing it worse. All too often the nature of the data used, or the size 
of the sample employed, is not satisfactory for multiple regres- 
sion purposes. 

1The author is indebted to the University of Tennessee Computer Center 
which provided the use of its facilities. This center is in part supported by 
Hus No. NAS8-11189 from the National Aeronautics and Space Administra- 


2 Now at American Institutes for Research, Palo Alto, California. The 
author wishes to thank Edward E. Cureton for his help and suggestions, 
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The multiple regression technique assures a solution for the 
set of regression weights with two primary properties: (a) the 
sum of squares of differences between the actual and predicted 
dependent variable values will be a minimum, and (b) the cor- 
relation between the actual and predicted dependent variable 
values will be a maximum, where both of these properties apply 
to the sample from which the weights were derived. 

A great deal of confusion exists concerning the application of 
multiple regression procedures. This confusion stems in part from 
the fact that the model and theory underlying multiple regres- 
sion is not appropriate for the majority of situations encountered 
in behavioral research. Multiple regression is a technique which 
is best adapted to experimental rather than survey data, and in- 
volves assumptions which are not tenable in nonexperimental 
situations, especially those which involve the use of psychological 
tests. One of the basic assumptions of multiple regression is 
that the values or levels of the independent variables are de- 
cided upon and fixed by the experimenter prior to conducting 
the experiment. Thus they are under the control of and set by the 
experimenter and are not subject to error. Only the dependent 
variable is free to vary from subject to subject, and, if the usual 
sampling error formulas are to apply, its distribution must be 
normal. Thus the dependent variable is the only variable sub- 
ject to error. This is termed the “regression” or Fixed-X model 
since the values of the independent or X variables are assumed 
to be fixed by the experimenter. 

This is generally not the situation we find in psychology, at 
least not in most cases where multiple regression is applied. The 
values of the independent variables are often the scores which 
an individual makes on a group of psychological tests of mod- 
erate reliability. Not only are the values of the independent 
variables not fixed by the researcher, but they are also subject 
to error—often to large error. 

The “regression” or Fixed-X model does not apply to these 
conditions, and in its place has been proposed the “correlation” 
or Random-X model. This second model allows the independent 
variables to vary freely. However, the Random-X model is so 
complex that usable computational procedures have not been de- 
veloped for many of its aspects (Burket, 1964; Nicholson, 1948). 


X situation, Fixed-X regression procedures are ordinarily applied. 

Graybill (1961) and Nicholson (1948) indicate that if the 
assumptions of the Fixed-X regression model are met, sample 
beta weights are unbiased estimates of the population beta 
weights; and, when Fixed-X regression model procedures are ap- 
plied to Random-X data, the sample beta weights are maximum 
likelihood estimates of the population beta weights. However, 
it is not clear that sample beta weights for Random-X data, 
obtained using Fixed-X procedures, are unbiased estimates of 
the population beta weights; or that the sample beta weights, 
derived from a sample of finite size, and which maximize the 
multiple correlation in the sample, are the “best” estimates of 
the population beta weights which can be obtained from that 
sample—“best” in the sense that their use in the population will 
minimize the difference between the resulting aggregate correla- 
tion in the population and the population multiple correlation. 

This misapplication of the Fixed-X model to Random-X data 
causes an over-fitting of the regression surface to the sample data. 
The regression surface is fitted to the errors as well as to the 
systematic trends. This over-fitting, or error-fitting, results in 
two types of errors: (a) the population multiple correlation is 
over-estimated by the sample multiple correlation coefficient, and 
(b) the interpredictor variability of the n sample beta weights is 
inflated, relative to that of the population beta weights. The 
variability referred to here is not the variance or standard error 
of a single beta weight for a single predictor variable, but rather 
the standard deviation of а set of n beta weights derived from 
а sample of n predictor variables. For example: given a sample 
With five predictor variables and one criterion variable, five beta 
weights are calculated. It is the variability of these five beta 
weights taken as a group which is of concern here. The reason 
for this inflation of their variability is obvious; not only are the 
sample beta weights subject to the true variability of the popu- 
lation weights; they are also subject to the variability due to 
Sampling and measurement errors. 

The first type of error caused by the misapplication of the 
Fixed-X regression model to Random-X data—that is, the over- 
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It actually provides little more than a way to talk about the 
situation and data; for even when the data are from a Random- 
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estimation of the population multiple correlation by the sample 
multiple correlation—has generated considerable research.. For in 
stance, shrinkage formulas and cross-validation are sugges 
techniques for overcoming the problem. (See Claudy, 1969, for a _ 
reyiew of this literature.) However, little attention has been paid 
to the inflation of the variability of the n sample beta weight 
as compared with the actual variability of the n population beta _ 
weights, and in fact this is a seldom mentioned finding. Pre- 
liminary empirical studies indicated that the variability of ваше 
ple beta weights is inflated at all sample sizes, but that the de- _ 
gree of inflation decreases with increasing sample size. Cureton 
(1962) proposed a method to determine sample regression weights 
whose variability will be more nearly equal to that of the popu- 


lation beta weights. This method is termed the "least deviant” 
procedure: 


1. Divide the original sample into two equal subsamples. 
2. Determine the beta weights on each subsample. | 
3. Arrange the beta weights from both subsamples in a single 
rank order from highest positive to lowest positive or high- 
est negative and determine the median. К 
4. From the pair of beta weights for each variable, select the | 
one nearest the median and use this as the weight for that 
variable in the regression equation. 


A second method for reducing the variability of sample beta 
weights, termed here the “average” procedure, is suggested. How- 


ever, it can not be expected to reduce the variability as much as 
will the “least deviant” procedure: 


1. Divide the original sample into two equal subsamples. 

2. Determine the beta weights on each subsample. 

3. Determine the mean of the pair of beta weights for each 
variable and use this mean as the weight for that variable 
in the regression equation. 


Though their relation to the double cross-validation procedure 
(Mosier, 1951) is obvious, neither of these variance-reduction 
procedures has any real theoretical or mathematical basis, and 
prior to this time neither has been empirieally studied. Thus the 
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first purpose of this study was to investigate empirically the ac- 
curacy of prediction using regression weights selected through 
the use of these variance-reduction procedures. 

Previous research (Boyce, 1955; Douglass, 1958; Lawshe and 
Schucker, 1959; Marks, 1966; Perloff, 1951; and Wesman and 
Bennett, 1959) has indicated that for small samples, equal raw 
score weights and criterion correlation weights usually yield 
higher cross-validities when applied to a new sample than do 
sample beta weights from the first sample. However, these studies 
suffered from small population sizes, small numbers of рорша- 
tions, and very restricted ranges for the population parameters 
(variable means, standard deviations, and intercorrelations). 
Therefore, the second major purpose of this study was to in- 
vestigate empirically, under less restrictive conditions, the rela- 
tive predictive accuracy of equal raw score weights, sample cri- 
terion correlation weights, and sample beta weights. These three 
weighting procedures were also compared with the two variance- 
reduction procedures. (Criterion correlation weights are the first- 
order correlation coefficients between each of the m predictor 
variables and the criterion variable.) 


Method 


Only small sample sizes were used. The reasons for this limi- 
tation were twofold: (a) many current studies, especially ap- 
plied studies, making use of multiple regression procedures have 
used only small sample sizes; and (b) the errors in the results 
of multiple regression procedures are greatest with small sample 
Sizes and thus there is more room here for improvement. Rather 
than use real data, the decision was made to use computer gen- 
erated populations of data, and to draw samples from these pop- 
ulations. Thus the characteristics of the populations could be 
widely varied and the generality of the findings estimated. 

A FORTRAN IV program for the IBM 360/40 computer was 
written to generate the data. Input to this program consisted of: 
the size of the population; the total number of variables (inde- 
pendent plus dependent); the number of common factors for 
these variables; the desired simple structure factor matrix, in- 
cluding unique loadings for the variables; and a randomly chosen 
six digit number used to initialize а pseudo random normal 
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number generator. The inputted simple structure factor matrix 
allowed both the factor structure of each variable and the in- 
tercorrelation matrix among the variables to be determined prior 
to the generation of the data. In this way the generated popula- 
tions could be made to correspond to the sorts of populations 
found in real data, as reported in the literature. Factor analyses 
of variables generated in this manner indicated that their factor 
patterns correspond very closely to those of the inputted factor 
pattern matrices, and regression analyses indicate that the vari- 
ables are linearly related. After generation, the entire popula- 
tion was stored on a random access disc file of the computer. 

Eighteen independent populations of 500 sets of observations 
each were generated. The selection of a population size of 500 
was a compromise between a desire to have the populations as 
large as possible, a desire to use as many populations as possible, 
and the space and time limitations imposed by the computer. 
The parameters of the populations were chosen to represent the 
statistics of samples of real data, as they have been reported in 
the psychological and educational literature, for a wide range 
of variables. Thus the populations differed in their intercorrela- 
tions, factor structures, and numbers of variables. However, in 
no case did the number of independent or predictor variables 
exceed five. This limitation was imposed because prior studies 
have indicated that prediction is seldom improved by the ad- 
dition of independent variables in excess of this number. 

From each population, a total of 400 independent samples 
were drawn, consisting of 100 of each of the following sizes: 20, 
40, 80, and 160. The sets of observations to be included in each 
sample were selected by the use of a random number generator 
which generated a number between one and 500 inclusive. A 
sample of size N consisted of the first N random numbers gen- 
erated, a different set of N numbers being generated for each 
sample. Due to the limited population size, these samples were 
drawn with replacement. Using this procedure the finite popu- 
lation of 500 observations is equivalent to an infinite population 
in which the same series of 500 Observations is repeated an in- 
finite number of times. Thus, the parameters of the population 
of 500 are the same as those of the infinite population it rep- 
resents, and the sampling is approximately equivalent to what 
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it would have been had the samples been drawn without replace- 
ment from an infinite population having the same parameters 
but in which all combinations of predictor and criterion scores 
could appear. It should be noted, however; this implies that each 
observed score is the true score for that individual on that vari- 
able. In other words, this study concerns the effects of different 
sample sizes only, and does not consider the effects of unreli- 
ability. 

Each sample was treated in the following manner. Four sets of 
regression weights were derived: the two types of split sample 
variance-reduction weights (“least deviant” and “average”), beta 
weights derived from the complete sample, and criterion correla- 
tions derived from the complete sample. These four sets of 
weights, as well as a set of equal raw score weights (unities) 
were then applied to every member of the population to ob- 
tain five population validities. The two types of variance-reduc- 
tion weights and the sample beta weights were applied to the 
predictor variable z scores. This was done since the variance- 
reduction weights were considered estimates of the population 
beta weights, and beta weights are correctly applied only to 
standard scores to yield an estimate of the criterion variable 
standard score. The criterion correlation weights and the equal 
Taw score weights were not considered estimates of the popula- 
tion beta weights and thus they were applied directly to the 
predictor variable raw score values. Applying these weights to 
raw score values is not equivalent to applying them to 2 scores. 
However, the decision to include criterion correlation weights 
and unit weights was motivated by the desire to determine if 
“simpler” weighting procedures are as effective as the “more 
complex” procedures, and applying weights to raw score values 
is “simpler” than applying weights to 2 scores. By applying the 
five sets of weights to every member of the population, it was pos- 
sible to determine which weighting method yielded the best pre- 
diction of the dependent variable values in the total population 
and could thus be considered the “best” weighting method for 
that population and sample. Four sample sizes were used to de- 
termine whether different weighting methods are more effective 
for different sample sizes, and the use of 100 samples of the 
sample size sought to insure the accuracy of results. This pro- 
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cedure was repeated for each of the eighteen independent рор 
lations. ‘a 


Results 


Table 1 presents for each of the 18 populations the wei 
procedure which resulted in the highest average population 
lidity. If one of the other weighting procedures resulted i 
population validity which was no more than .01 less than the 
the procedure resulting in the highest population validity, 
procedure is given in parentheses. The relations among ва 
size, weighting procedure, and validity for a typical pop 
tion are presented in Table 2. { 

The two variance-reduction procedures, "least deviant" ant 
"average," were inferior to the other procedures at the sm 
sample sizes studies and about equal in predictive effective 
at the larger sample sizes. However, in no case were they supi 


TABLE 1 
Weighting Procedure Resulting in Highest Population Validity 


Population Sample size 
Number 20 80 160 

1 E C (A, ВЕ 
2 E nO) ya 
3 E B (A,C,E,L) 
4 E C (E) 

5 E E (В,С) 
6 E С (E) 

т Е Е 

8 Е Е 

9 Е Е 
10 Е Е 
11 Е C (B) 
12 € (E) C (A,B) 
13 Е В (A,0,B) 
14 Е (А) 
15 Е В (АТ) 
оз 3 

В) 
18+ B,C EX 


А = Average variance-reduction weights. 

В = Beta weights. hia 

С = Criterion correlation weights, 

E = Equal raw score weights. 

L = Least deviant variance-reduction weights. 


Note.—Weighting procedures resulting in population. ER 
of the procedure resulting in the highest population liany Rit no more than 01 lese tha 
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TABLE 2 
Average Population Validities in Population Five 


Method of regression weight estimation 
Criterion 


Least Deviant Average Sample Betas Correlation 


Sample uu 
Size Mean St Dev Mean St Dev Mean St Dev Mean St Dév 
Population multiple correlation = .531 
Equal raw score weights = .526 
20 .437 .174 .421 .136 .455 .079 .497 .053 
40 .492 .063 .495 .039 .498 .035 .518 .019 
80 .517 .018 .514 .020 .516 .017 .525 .007 
160 .523 .007 .522 .009 .522 .008 .528 .004 


Note.—Each entry is based on 100 samples. 


to the other three weighting procedures, each of which required 
considerably less computation. The “least deviant" procedure did 
yield sample regression weights whose variabilities were more 
nearly equal to those of the population beta weights, but ap- 
parently the improvement in variability was offset by other fac- 
tors. The use of the "average" procedure did not result in any 
improvement in the variability of the sample regression weights. 


Discussion and Conclusions 


Since the use of the variance-reduction procedures did not re- 
sult in weights whose population validities were superior to those 
obtained using the more conventional and computationally sim- 
pler procedures, their use for the estimation of population least 
Squares regression weights is not recommended. 

An examination of the results presented in Table 1 suggested 
that it might be possible to group the 18 populations according 
to the predictive effectiveness of the regression weight estima- 
tion procedures. Accordingly, a rough “eyeball” analysis was 
carried out and from this analysis three classes of populations 
emerged. These classes were termed X, Y, and Z. The popula- 
tions which comprised each class tended to have certain features 
in common, especially with regard to the intercorrelation ma- 
trices and the correlations between the dependent and inde- 
pendent variables. Class X populations, numbers 1 through 10, 
tended to have criterion correlations which exhibited low vari- 
ability and predictor intercorrelations which were in the low to 
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moderate range, .00 to .40. Class Y populations, numbers 11 
through 14, were typified by criterion correlations of low varia- 


bility, as in Class X populations, but they tended to have рге- } 


dictor intercorrelations which were either low negative, —.20 to 
00, or high positive, 40 and above, with relatively few in the 
low positive range. Finally, Class Z populations, numbers 15 
through 18, had criterion correlations which were highly vari- 
able and predictor intercorrelations in the moderate negative 
to moderate positive range, —.30 to 40. These class charac- 
teristics were not perfectly represented in all populations of a 
given class, and the numbers of populations in Class Y and Class 
Z are small, but they are descriptive of the overall findings. 
The number of independent variables was not related to pre- 
dictive effectiveness of the weighting procedures. Also, the rela- 
tive values of the mean predictor intercorrelation and the mean 
criterion correlation of a population appear to be independent 
of class membership. 

Based on findings discussed above it is possible to develop а 
tentative set of general guidelines concerning what weighting 
procedure can most effectively be used for certain sample sizes: 


1. For samples in which the criterion correlations show little 
variability and the intercorrelations among the dependent 
variables are of low to moderate size (Class X), the use of 
equal raw score weights is recommended. 

2. For samples in which the criterion correlations show little 
variability and the intercorrelations among the independent 
variables are on the average either high positive or low 
negative (Class Y), use equal raw score weights for samples 
of less than 50 and beta weights for larger samples. 

3. For samples on which the criterion correlations show wide 
variability (Class Z), the use of beta weights is recom- 
mended at all sample sizes. 


Even if equal raw score weights are used it may still be nec- 
essary to determine a least squares regression equation to pre- 
dict the values of the criterion variable for individuals. Applying 
equal weights to the raw score values of the independent variables 
results in a composite which is nothing more than a sum of 187 
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scores. These sums for a group of individuals predict the relative 
standings of these individuals on the dependent variable, but 
give no information concerning their actual criterion scores. In 
order to determine this information it is necessary to determine 
the regression of the composite sum on the dependent variable. 
Then, given a composite sum, the actual values of the criterion 
scores can be predicted. 

The results of those aspects of this study which have been 
previously investigated are in agreement with the prior studies. 
Boyce (1955), Lawshe and Schucker (1959), and Wesman and 
Bennett (1959) studied the effectiveness of equal raw score 
weights compared with beta weights. In all three studies the use 
of equal weights was found to be as good as or better than the 
use of beta weights for small and moderately small samples. 
While only equal raw score weights were here investigated, it is 
noteworthy that the studies of Boyce (1955), Douglass (1958), 
Marks (1966), and Perloff (1951), which investigated equal stan- 
dard score weights, also found these weights as good as or better 
than beta weights when samples are not large. Both Marks (1966) 
and Perloff (1951) found that beta weights became more effective 
at sample sizes of 200 and above. This finding tends to be sup- 
ported by this study. 

The major difference between these earlier studies and the 
current one is the fact that this study determined the population 
validity of the regression equation rather than a sample cross- 
validity. The researcher is usually interested in the population 
validity and thus this study is more representative of this in- 
terest than those prior to it. As in the prior studies, the prob- 
lem of variable unreliability was not investigated. 

Consideration of these results leads to a recommendation 
which has important implications for practice. When only small 
sample sizes are available (N less than 200), the use of mul- 
tiple regression procedures is often of doubtful value. In many 
cases more easily derived weights result in population validi- 
ties which are superior to those of a regression equation arrived 
at by multiple regression. Thus, use of the set of general guide- 
lines is recommended. However, these guidelines must be regarded 
as tentative and subject to revision as more populations are 
studied. 
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ESTIMATING TRUE SCORES USING GROUP 
MEMBERSHIP! 


CHARLES E. WERTS лхо ROBERT І. LINN 
Educational Testing Service 


Ture classical approach to estimating true scores given group 
membership information is to use the formula 


Та - X, + Ка (Хаи — Xj) (1) 


where Î,, is the estimated true score, 

X, is the observed mean of group j, 

R,, is the test reliability, assumed homogeneous across 

groups, А 

X,, is the observed score for person û in group j: 
If two parallel tests were available the reliability could be com- 
puted as the correlation between tests, however, two sets of in- 
dividual values and group means would be observed. The esti- 
mation problem is to use all information to obtain a better 
true score estimate. The general problem of using group status to 
estimate true scores given multiple measures will be considered 
in this paper. 

The linear regression of Т on X which is implicitly assumed 
for equation (1) will hold only under very restricted conditions 
including normally and independently distributed error and true 
scores (Lord and Novick, 1968, theorem 22.8.1). Thus, equation 
(1) is generally only an approximate procedure and its usefulness 
depends on the assumption that the regression of T on X is not 


1 The research reported herein was performed pursuant to Grant No. OEG- 
_ 2-700033(509) with the United States Department of Health, Education, and 
Welfare and the Office of Education. 
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too nonlinear. This limitation also applies to the procedure that 
is presented below. 


Congeneric Measures 


Jóreskog (19692) has shown that equivalent (parallel), es- 
sentially tau-equivalent, and tau-equivalent measures (as defined 
by Lord and Novick, 1968, pg. 50) are special cases of the con- 
generic test model. The model for сопрепеме measures can be 
written 


Ха = BT, + I, Tea Q) 


where Ху, is the observed value on test k for person 7, 
T, isthe true score, 
B, is the slope of the X; on Т; regression line, 
I, is the intercept of the X,, on T, regression line and 
е, are the errors of measurement for test k, assumed 
to have а zero mean for all levels of Т,. 


The errors of measurement in one test are assumed to be in- 
dependent of the errors in the other congeneric tests. Given three 
such tests the reliabilities for each test may be uniquely specified 
(Lord and Novick, 1968, equation 9.12.4). As Jéreskog (1968) 
notes, equation (2) in a factor analytic model and given at least 
four measures the congenerie model may be tested under the as- 
sumption that the observed variables have a multivariate nor- 
mal distribution. If the congeneric model provides a reasonable 
fit then the true scores could be estimated using estimated factor 
scores, When additional assumptions about the tests appear war- 
ranted, the corresponding restrictions on the parameters of the 


model may be specified by using Jóreskog's (1969b, 1970a) con- 
firmatory factor analysis procedures. 


Group Status 


The information on membership in one of J groups may be coded 
as a set of dummy variables D; (D; = 1 if the person belongs to 
group j and D; — 0 otherwise), a total of J—1 dummy variables 
being required to encode all the information. In the usual procedure; 
one of the groups is designated the reference group, which in OUT 
analysis will be the first group (Le, j = 1). To understand the 


OO 2. = a 


WERTS AND LINN 325 


function of dummy variables in our analysis, consider the prediction 
equation in which a continuous variable is predicted from a set of 
dummy variables: 


J 
У, = А+ 5 В, D; + Ел. 


The regression weights В, are equal to the difference between 
the mean on У of that group and the mean оп У of group 1, and 
A is equal to the mean on Y of group 1. The predictable variance 
is the squared correlation ratio і.е., the ratio of the weighted 
variance of the group means divided by the total variance of 
Y. In essence, for prediction purposes, the dummy variable set is 
equivalent to computing the group means and using these to 
predict the individual's total score, Ух. 

The problem in the present context is that the variable to be 
predicted (ie. estimated) is the true score T, which has not 
been measured directly. This difficulty сап be overcome by de- 
veloping the correlations of the observed dummy variables with 
the unmeasured variable T, which was defined by equation (2). 
Estimates of these correlations may be obtained from Jóreskog's 
confirmatory factor analysis model (1969b, 1970a) by defining a 
set of factors (Dj) which exactly correspond (i.e., Dj = Dj) 
to the observed set of dummy variables. The resulting factor 
model is: 


X=AF+Z (3) 
where X is a (k + J — 1) vector of observed test scores and 
dummy variables, 
% Е is a J vector of latent common factor (Е + J — 1) 
\ scores, 


Z іва (Е + J — 1) vector of unique scores, h 
and A isthe matrix of order (k + J — 1) by J of factor loadings. 
In the case of k = 4 congeneric measures and J = 4 groups; 


Х' = (Xin, Хи Хи» Ха, Da, Ds, D2, 
Е’ = (Та, Da’, Ds’, De’), 
Z! = (л, ёз, биз, би) 0, 0, 0), апа 


326 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Bind mao. 1:0 
Вил јел о 0 
Вено 0. 0 
A=|B 0 0 0 (4) 
о Ју О 0 
0 0 1 0 
OHO ©. 1 
For this example Ф, the variance-covariance matrix of Е would be: 
1 Symmetric А | 
ae Съ,т * (5) | 


Cp, * * 
Cour * * * 


where * are the known variances and covariances among the | 
dummy variables, and 
C», are the desired covariances of the dummy variables with | 
the standardized true score (ће., сг? = 1). Dividing by | 
the standard deviation of that dummy variable yields the 
corresponding correlation, 


Jéreskog’s computer program (19706) will estimate the “free” 

parameters in A and Ф i.e., Bı, Bs, Bs, By, Сь,т, Cher, and Cor I 
the input is the correlation matrix among observed tests and dummy, 
variables then B,? is an estimate of the reliability of test’ and . 
Cy, divided by cp, is an estimate of the correlation of D; and Ти. 


The multiple correlation of T, with the dummy variables is the 
"true" or unattenuated eta le. the relative variance of the t 

means is identifiable. Given the correlations among all variables, tle 
standardized regression weights for predicting Т,; from the «test 
scores X, and the dummy variables D; may be obtained by means of | 
standard least squares regression procedures. If relative estimates 
of the true means were desired, then only the dummy variables 
should be used as independent variables to predict T,,; the regression 
weights corresponding to the difference between the true mean for 
the group indicated by the dummy variable and the true mean of the 
reference group. In using Jéreskog’s program (1970b) for the analysis, 
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it is preferable to use the least squares rather than the maximum 
likelihood estimating procedure because the latter assumes that the 
observed variables have a multivariate normal distribution (clearly 
not true for the dummy variables) whereas the former requires no 
distributional assumptions. 

In passing it might be noted that the weighted reliability of the 
means for each test can be computed since the reliability of the means 
divided by the test reliability is equal to the squared true eta divided 
by the squared observed eta for that test. If for each of the k tests, 
the reliability of the means equals the reliability of that test, the 
dummy variables (representing the true group means) will not 
contribute to the prediction of the individual true scores. 
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AN ANALYSIS OF TWO RESPONSE STYLES: 
TRUE RESPONDING 
AND ITEM ENDORSEMENT? 


MARTIN Е. MORF? лхо DOUGLAS М. JACKSON 
University of Western Ontario 


A true or false response to an item like “I usually help old 
ladies across the street,” is clearly the result of a variety of deter- 
minants. In this case, the respondent may in fact usually help old 
ladies across the street, and be high on a nurturance dimension. 
However, he may also tend (a) to manifest a general tendency to 
respond true to test items, (b) to endorse test items as descriptive 
of himself, and (c) to respond consistently in a desirable or an un- 
desirable direction. In other words, his response may be deter- 
mined by the substance or content of the item, as well as by aspects 
of its form, such as ambiguity and salience, positive as opposed 
to negative wording, and desirability scale value. The effects of 
Such aspects of item form, interacting with subject characteristics 
and manifesting themselves as response styles (Jackson and Mes- 
on 1958, 1962), were the primary focus of the present investiga- 
lon. 

: А major aim of this study was to distinguish clearly between two 
interpretations of acquiescence response style. Initially, acquies- 
сепсе was defined as a response set which manifests itself primarily 
ш the tendency to respond true to test items (Lentz, 1938; Cron- 
bach, 1942). In these earliest papers, little attention was directed 


НЕ, У 
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at these tendencies as a characteristic of respondents; the em- © 
phasis was primarily upon tests and their propensity to elicit such 
sets. Cronbach (1946; 1950) assembled data bearing on response 
sets, both as properties of tests or test items and as variables on 
which subjects differ reliably which might attenuate the logical 
validity of a test score. Jackson and Messick (1958) in reviewing 
a number of studies sought to highlight the hypothesized persono- 
logical basis for these tendencies by designating them as response 
styles, which might have predictive value for certain criteria, but 
which are logically distinct from dimensions of content. Further, 
they hypothesized that the major dimensions of a number of per- 
sonality inventories, such as the MMPI, are interpretable pri- 
marily in terms of style rather than content. In a series of factor 
analytic investigations performed on three different samples, - 
Jackson and Messick (1961, 1962) identified two large dimen- 
sions on the MMPI accounting for more than 75 per cent of the 
common variance. One of these was highly associated with the | 
desirability scale value of the items, and the other completely sep- 
arated true and false-keyed scales. Jackson and Messick identi- 
fied this second factor as acquiescence. This interpretation was 
challenged by a number of authors, including Block (1965), 
Rorer (1965), and Rorer and Goldberg (1965a, 1965b). Block 
raised a number of points which haye been discussed elsewhere 
(Bentler, 1966; Jackson, 1967е, 19674; Block, 1967), but the 
major issue raised by Rorer and Goldberg was that a true re- 
sponse to an MMPI item in its original form was usually associ- 
ated with a false response to that same item in reversed (usually 
negative) form. Rorer and Goldberg attributed this apparent con- 
sistency in responding to MMPI items to reliable responding t0 
item content. However, Jackson and Messick (1965) demonstrated 
that the same very general factors were operating in the Rorer 
and Goldberg findings as had been identified previously, and sug- 
gested that these data might be amenable to a stylistic interpre- 
tation. 

The present study pursued the question of stylistic responding in 
the context of original and reversed, positively and negatively 
worded items in a study carefully designed to avoid some of the 
problems inherent in the use of the MMPI. By using carefully se- 
lected items eliciting hypothesized content and stylistic processes; 
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especially constructed response style marker scales, and a new 
analytical rotation procedure, the investigators hoped to be able to 
throw light on a still thorny and controversial question. 

Because the separation of the distinct acquiescence processes of 
true responding and item endorsement is critical in the interpreta- 
tion of the results of the present investigation, it is important to 
distinguish these two processes explicitly. True responding refers to 
the differential tendency to respond true to an item regardless of 
the direction of wording. Persons high on this tendency would be ex- 
pected to respond true both to an original item and a logical re- 
versal. Item endorsement, on the other hand, refers to the tendency 
to ascribe characteristics to oneself regardless of the direction of 
keying. Thus, an individual high on item endorsement would be 
expected to answer true to a positively stated item, but to respond 
false to a negation of that same item. The examples contained in 
Tables 1 and 2 are useful in distinguishing these response styles. 

Consider & family of statements drawn from the PRF Autonomy 
scale, which in its positively worded true-keyed version takes the 
following form: “I usually try to solve problems by myself.” This 
is the sort of question that a person high on true responding would 
endorse by responding true, and if he did this consistently over a 
number of heterogeneous items he would obtain a high score on a 
true responding dimension. Note, however, that this person, if he 
is high on a true responding dimension, would also endorse a posi- 
tively worded reversal of this same item, now keyed false for an 
Autonomy scale. Note also that a person high on the true respond- 
ing dimension would tend to respond true to negatively worded 
items regardless of whether or not they are keyed true or false. The 
individual low on the true responding dimension would, of course, 
show opposite tendencies. 

Th terms of correlational patterns, true-keyed scales measuring 
different content areas should be more highly positively correlated 
№ the presence of true responding than in its absence. Similarly, 
false-keyed scales should be more highly positively correlated 
than would be expected on the basis of content. On the other hand, 
true-keyed and false-keyed scales would tend to be more highly 
And Negatively correlated than expected on the basis of content. 
This would lead to an hypothesized factor, distinct from other re- 
Sponse styles and from content, on which ideally all true-keyed 
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scales would load in the same direction and false-keyed scales 
would load in the opposite direction. 

This explanation of how the hypothesized tendency to respond 
true should manifest itself is rather complex and Table 1 consti- 
tutes an attempt to present its salient features. 

Consider now the same family of statements drawn from the 
PRF Autonomy scale in relation to the hypothesized tendency to 
endorse the characteristics referred to by items. Positively worded 
items are true keyed for item endorsement since a true response re- 
flects affirmation of a trait, characteristic, or behavioural tendency. 
A person high on item endorsement should respond true to posi- 
tively worded items and should respond inconsistently to a posi- 
tively worded item and its positively worded logical reversal. 
Similarly, a subject high on item endorsement should respond false 
to negatively worded items regardless of logical consistency. Since 
these items contain a negation, a second negation, reflected in a 
false response, is required to attribute to oneself the trait referred 
to by them. It is these expected false responses to negatively 
worded items that differentiate item endorsement from true re- 
sponding. 

As in the case of true responding, it can be shown that item en- 
dorsement should manifest itself in a specific pattern of correlations 
between positively worded and true-keyed, positively worded and 
false-keyed, negatively worded and true-keyed, and negatively 
worded and false-keyed scales. Positive correlations would be ex- 
pected between positively worded and true-keyed and negatively 
worded and false-keyed scales and between positively worded false- 
keyed and negatively worded true-keyed scales, while negative 
correlations would be expected between the remaining four possi- 
ble pairs of scales. Item endorsement should thus emerge as a fac- 
tor on which positively worded true-keyed scales and negatively 
worded false keyed scales obtain loadings in one direction and ол 
which positively worded false-keyed and negatively worded true- 
keyed scales obtain loadings in the opposite direction. 

Here again, the explanation of how the hypothesized tendency to 
endorse items should manifest itself is complex and Table 2 соп- 
stitutes an attempt to present its salient features more clearly. 

The present study examined the general hypothesis that there 
are two distinct response sets subsumed by the term acquiescence 
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which are related to certain formal and connotative properties of 
items, and which can be distinguished from dimensions of content, 
-as well as two other stylistic dimensions. The two other stylistic 
dimensions are the well documented but nevertheless controver- 
sial desirability factor, as elicited by items explieitly selected to 
be extreme in terms of judged desirability, and а dimension re- 
flecting a differential tendency to endorse many or few adjectives 
as descriptive of oneself (cf. Jackson and Lay, 1968; Bentler, 1969) . 

Because of the important distinction between attitude and per- 
sonality items, both in terms of their formal properties and in 
terms of the hypothesized “species” of acquiescence which they 
tend to elicit (Jackson, 1967b; Messick, 1967), it was considered 
essential to incorporate within the design a systematic variation 
of self-descriptive personality versus attitude item format. Thus, 
in the tradition of Guttman (1958), a facet design was employed 
in which dimensions of content, dimensions of response style, and 
formal properties of items are systematically varied so as to elu- 
cidate their mutual interrelations and the conditions under which 
Certain processes will emerge. This kind of design, allowing con- 
tent to appear under a variety of types of item format, direction 
of wording and of keying, while at the same time allowing response 
style to appear in the context of a variety of content dimensions, 
Was used to avoid the sometimes fruitless controversy that sur- 
Tounds the use of an item pool such as that of the MMPI. In the 
Case of the MMPI, none of these properties can safely be assumed 
to be independent, systematically varying, or, in the case of ho- 
Mogeneous dimensions of personality scale content, even unequi- 
vocally identifiable. In addition, the study was designed to avoid 
certain other problems attending earlier research, such as the 
Correlated error present in item overlap between scales. Finally, 
the facet design used was expected to permit a clearcut empiri- 
al distinction between content and style. 

The facet design lead to a general prediction that particular con- 
tent dimensions and particular stylistic dimensions would emerge 
from the analysis. In particular, with regard to the nature of the 
hypothesized true responding and item endorsement acquiescence 
Processes, the following three predictions were made: 

1. Factor analysis of a battery of positively worded true-keyed, 


Positively worded false-keyed, negatively worded true-keyed, and 


336 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


negatively worded false-keyed scales yields a true responding fac- 
tor separating true- from false-key scales and an item endorse- 
ment factor separating acceptance- from rej ection-keyed scales. 

2. Scales consisting of external referent attitude items elicit 
true responding more strongly than item endorsement, i.e., they 
obtain higher loadings on the predicted true responding factor than 
on the predicted item endorsement factor. 

3. Scales consisting of self-descriptive personality items elicit 
item endorsement more strongly than true responding, i.e., they ob- 
tain higher loadings on the predicted item endorsement factor than 
on the predicted true responding factor. 


Method 
Subjects 


A total of 196 volunteers, 87 males and 109 females, all uni- 
versity students, were tested. Their age ranged from 16 to 34 years, 
their median age was 18. 


Experimental Measures 


A two-part questionnaire, consisting of 560 true-false person- 
ality and attitude items, and an adjective checklist were used to 
assess 51 variables. The adjective checklist consisted of 45 de- 
sirable, 90 neutral, and 45 undesirable adjectives selected from 
Anderson’s (1968) list. They were arranged in random order. 
Separate scores, the number of desirable, neutral, and undesir- 
able adjectives checked as self-descriptive, were obtained. 

Personality and attitude scales. The 32 experimental scales con- 
tained in the questionnaire consisted of four groups of eight scales. 
Each group represented a different content area. The content areas 
included were Exhibition, Play, Succorance, and Understanding. 
These particular content areas were selected because they repre- 
sented a range of different kinds of content, and because sizeable 
numbers of suitable items were available from the original Per- 
sonality Research Form (Jackson, 1967a) item pool, from which 
items meeting specific criteria could be chosen. Items were care- 
fully selected on the basis of their intermediate endorsement fre- 
quencies, low correlations with desirability, and moderate content 
saturations as reflected in moderate (yet definite) biserial cor- 
relations with their respective total scales, All three of these char- 
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acteristics have been shown to maximize the acquiescence-eliciting 
potential of items (Hanley, 1962; Jackson and Lay, 1968; Jackson 
and Messick, 1961; Trott and Jackson, 1967; Wiggins, 1962). 
For each content area, some items selected were modified so as to 
provide an initial set of four 6-item scales. Each scale was de- 
signed to fit one of the following categories of items: positively 
worded true-keyed; positively worded false-keyed; negatively 
worded true-keyed; and negatively worded false-keyed. The mod- 
ifieations involved changing the items from, for example, posi- 
tively worded to negatively worded, such that there were a suf- 
ficient number to round out the total facet design. It was necessary 
to translate some of the items because not enough of the various 
combinations were available from the original item pool. Sim- 
ilarly, it was necessary to duplicate 12 items by arranging the 
same basic stem in different forms so that sufficient items of each 
type were available for experimental use. Because none of the 
items chosen for the present study were high in content satura- 
tion, there is no overlap between them and those contained in the 
final forms of the PRF. 

Experimental attitude scales, parallel to the 16 self-descriptive 
Scales, were obtained by literal translation of the self-descriptive 
personality items described above into attitude items. For the 
purposes of this study, attitude items were defined as items eliciting 
ап evaluative response toward some external referent, In most 
Cases a self-descriptive item permitted identification of an ex- 
ternal referent which could serve as the subject of an attitude 
Statement. For example, the direct object of the positively worded, 
true-keyed Exhibition item “I would like to have a flashy car that 
would make others stop and look as I drove by” made a suitable 
Subject for a translation into attitude item form: “A flashy car 
that makes people stop and look is well worth paying a lot of 
money for.” 

Questionnaire marker scales. Four heterogeneous acquiescence 
Seales were constructed by selecting items from the MMPI with 
neutral desirability scale values (Messick and Jackson, 1961) and 
intermediate endorsement frequencies (Wiggins, 1964). Sixty per 
Cent of these items were used in their original form. The remaining 
items were used in their reversed and negatively worded form. 
Where available, the reversals of Lichtenstein and Bryan (1965) 
Were used; where not, new reversals were written. This procedure 
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yielded a set of 60 positively worded, and a set of 60 negatively 
worded items. Each set was randomly divided into two 30-item _ 
scales, one keyed true, the other keyed false. ; 

Four additional 36-item acquiescence marker scales were con- | 
structed by modifying items from the original large PRF item | 
pool. Four items for each of 18 PRF traits were selected. Depend- = 
ing on their original form, these were translated into either nega- б 
tively worded ог positively worded reversals, in the same man- 
ner as were the self-descriptive items for the experimental scales. | 
Four positively and four negatively worded items were thus avail- 
able for each trait. This permitted balancing content both 
by representing several traits on each acquiescence scale and by 
balancing the direction of keying within traits. Thus, two posi- 
tively worded and two negatively worded acquiescence scales 
balanced for content were constructed. Each scale consisted of 18 
pairs of items, one pair for each trait. The items of each pair mea- 
sured the same trait, but in different directions, one true-keyed for 
content, the other false-keyed. Four scales resulted, two scales con- 
taining positively worded items, one arbitrarily keyed true, the 
other false, and two scales containing negatively worded items, 
again, one keyed true, and one false. 

Since acquiescence has frequently been studied in the context 
of items written in the style of the California F scale, acquiescence 
marker scales comprised of such items were included. The Clayton 
and Jackson (1961) absolutely worded and relatively worded true 
and false-keyed attitude scales based on the California Р scale 
were selected for this purpose. 1 

Marker scales for desirable responding, in the form of the desir- 
ability seales from Form А and Form B of the PRF, were also 
included. Items for these scales were chosen by virtue of their 
higher correlation with a preliminary desirability scale than with 
preliminary PRF content scales (Jackson, 1967a). These two 
scales have been shown to define clearly a desirability factor also 
loaded by desirable adjectives (Jackson and Lay, 1968). In addi- 
tion, unlike many other desirability scales, these scales consist of 
items drawn from domains other than psychopathology, and аге 
thus free from this source of confounding content variance. 

Assignment of items to separate testing sessions. Because self- 
descriptive ‘personality items and attitude items were based on the 


same set of item stems, separate testing sessions and separate test — 
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booklets, designated Form A and Form B were planned to avoid 
the possibility of a subject comparing his responses to the two 
types of items. Except for the desirability scales, half of the items 
for each scale were assigned to Form A, the other half to Form B. 
One desirability scale was assigned to each form. A self-descrip- 
tive item and its attitude translation never appeared in the same 
form. 


Procedure 


Subjects were tested in four groups. Each group appeared for 
two 90-minute sessions separated by a one week interval. During 
the first session, Form A was administered, followed by a filler 
task; during the second session, subjects completed Form B, the 
adjective checklist, and a post experimental questionnaire. 


Results 


The final length of 10 of the experimental scales and two of the 
PRF acquiescence marker scales differed slightly from the initial 
length due to re-keying of nine items after the data had been col- 
lected. In some cases, the re-keying was necessitated by an initial 
error, in some by the adoption of more refined criteria for deter- 
mining whether an item was positively or negatively worded. 


Scale Reliabilities 


Table 3 lists the variables, together with, in the case of the 
questionnaire seales, their Kuder-Richardson formula 20 reliabil- 
ities. The median reliability for the 32 short experimental scales 
was .32. Projecting these reliabilities for comparison purposes to а 
scale of more usual length, а 30-item scale, yields a median of .70. 
Considering that these items were chosen from items moderate, 
rather than high, in content saturation, the reliabilities appear to 
be adequate. 


Factor Analytical Procedures 


The correlation matrix? of the 51 variables was subjected to a 


aie CODES Ewe 
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‘principal axes factor analysis. Bentler’s (1971) Clustran pro- 
“cedure was used to rotate the factor matrix. This method involved 
the use of an hypothesis matrix to specify the expected direction 
of loading for each variable on each of the factors. The eight pre- 
dicted factors comprised four stylistic dimensions: true respond- 
ing, item endorsement, desirability, adjective endorsement; and 
four content dimensions: Exhibition, Play, Succorance, and Un- 
derstanding. Adjective endorsement was expected to be a separate 
factor on the basis of evidence obtained by Jackson and Lay 
(1968). Since eight factors were predicted, the first eight principal 
axes factors, accounting for 90 per cent of the common variance 
as estimated by the sum of the original communalities, were 
| Tetained for rotation. The application of {һе Clustran proce- 
dure to the factor matrix yielded an oblique factor structure and a 
_ factor pattern fitting the hypothesis matrix best in the least squares 
Sense. Using the factor pattern and a matrix obtained from the fac- 
tor correlations, an orthogonal factor matrix was computed. It is 
reported here as the final solution in Table 4. This solution does 
hot differ markedly from the oblique factor structure, nor, indeed, 
from an independently performed graphical rotation of the princi- 
_ pal axes factor matrix. The Clustran orthogonal solution has the ad- 
т Vantage of permitting the estimation of the contribution of the 
— factors to the variance of each variable and vice versa. At the 
_ Same time, the availability of an oblique factor structure permits 
the identification of second order factors, if these are of theoreti- 

| eal interest. 


$ E 
Stylistic Factors 


In this section each of the stylistic factors is presented by means 

of & listing of the salient positive and negative loadings and a 

brief deseription. The abbreviations PT and PF stand for posi- 

| tively worded true-keyed and positively worded false-keyed scales, 

Tespectively, and NT and NF stand for negatively worded true- 
keyed and negatively worded false-keyed scales, respectively. 


Factor I 
All true-keyed scales loaded positively. 
All false-keyed scales loaded negatively. 


Rather than list all of the scales loading this factor, it is suf- 
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TABLE 4 
Rotated Factor Matriz 
Factor 

Variable Die У У VII уш в 
ЕЗРТ 23 23 —16 —06 55 10 04 —10 46 
ESPF -13 -17 15 —21 5-02 11 —15 44 
ESNT 12 —08 18 —04 57 08 14 —1 40 
ESNF —25 21 08 06 47 28 0 09 43 
EAPT 19 10 —18 —11 41 12 -10 —09 29 
EAPF —33 —11 04 05 40 05 26 07 36 
EANT 34 —08 —06 —01 25 —06 00 —01 19 
EANF =25 09 08 10 31 07-1 11 20 
PSPT DATAE A IS A. 13. 03 35 
PSPF —25 —09 —02 —06 12 60 08 —20 50 
PSNT 33 —01 —05 18 09 63 01 —08 54 
PSNF —26 26 —11 02 17 42 07 00 36 
PAPT 21 0922/07/17 502 05. уат 13 —08 34 
РАРЕ —33 —15 -16 02 02 43 —06 —09 35 
РАМТ 29 —11 —12 10 00 57 —03 —09 45 
РАМЕ OT По о " 45 18. 02 37 
SSPT O4 16—27 11111 08 15 55 —17 47 
SSPF =31 —09 06 06 14 -07 52 01 41 
SSNT 11 —04 05 06 17 04 55 —07 36 
SSNF —08 26 —22 00 04 18 51 -06 42 
SAPT 31 17 —12 13 09 04 39 —13 33 
SAPF —40 —15. 18. |02 9 —13 48 09 51 
SANT 15 -11 10 —02 —06 15 44 07 27 
SANF —20 04 —08 —04 —05 09 49 —08 31 
USPT 12 31 02 09 11 —09 —07 54 44 
USPF —22 05 —06 —15 01 04 —02 68 54 
USNT 12 —22 00. 09 01 —18 —09 45 31 
USNF —33 20 10 —06 06 05 —19 46 42 
ПАРТ 30 13 18 —07 —14 -25 —07 39 37 
UAPF 7—80 —12 —08 —11 —09 -01 15 66 59 
UANT 20 —18 —05 —06 -11 07 —04 17 13 
UANF —40 11 18. —22 —04 —10 00 46 48 
HMPT 84 53 —24 20 13 00 08 05 52 
HMPF —30 —52 49 —19 —01 —04 —05 —09 65 
HMNT 33 —36 41 —07 06 —08 —22 00 47 
HMNF 81 50 —24 05 08 10 23 —06 48 
HPPT O AO 2 19-. —05 .—04 09 43 
НРРЕ —38 —24 —03 —08 —12 —14 —10 —28 30 
HPNT 28 —25 —30 —11 —07 06 —13 —32 37 
HPNF 2910.88 1202203:04.. 0 в 02:10 33 
FRT 42 27 05 —18 —22 05 08 —38 49 
ЕВЕ —37 10 —09 —14 —03 12 06 —21 24 
FAT 49 26 10 —23 -22 00 09 —35 55 
FAF —39 17 —04 -28 —06 03 16 —10 30 
DA —10 —10 65 08 05 -11 -05 п 48 
DB -02 —1 68 —15 -01 —05 09 02 51 
AD 05 11 56 48 01 08 —09 —07 53 
AN 10 26 3 в 1 11 —05 60 
AU 05 21 —51 47 —01 06 13 —09 5 
INFR —16 —09 —2 10 18 04 -14 -01 11 
SEX —05 06 —13 —03 —24 09 34 13 22 
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TABLE 4 (Continued) 


Factor 
Variable I IL. TE О QV У НОЛ 


_ SUM OF 

SQUARES 3.74 2.48 2.70 1.54 2.15 2.52 2.58 2.80 
PERCENTAGE 

OF COMMON 

VARIANCE 18.2 12.1 13.2 7.5 10.5 12.3 12.6 13.6 


Г Note:—Decimals omitted. This matrix is the clustran matrix providing the least squares 
| orthogonal fit to the hypothesized factors. 


ficient to note that 1% separates true from false-keyed scales with- 


curred regardless of positive or negative wording. This factor 

clearly reflects true responding. The three highest positive load- 
M ings were obtained by the Clayton and Jackson relatively and 
| absolutely worded true-keyed F scales, and by one of the true- 
_ keyed acquiescence marker scales, while the three highest nega- 
tive loadings were obtained by a false-keyed absolutely worded 
- F scale and by false-keyed Understanding and Succorance items 
7 in an attitude format. 


Factor II 

MMPI acquiescence РТ 53 
MMPI acquiescence NF ў 50 
РВЕ acquiescence PT 40 
РВЕ acquiescence NF 38 
Understanding (Self-deseriptive) PT 31 
MMPI acquiescence NT —36 
MMPI acquiescence PF —52 


і Positive loadings on Factor II are of two types: those which 
Ё reflect the endorsement of a positively worded item and those 
1 Which reflect the denial of a negatively worded item. Logically, 
_ these are equivalent. 

The salient loadings on Factor II are almost exclusively from 
the acquiescence marker scales, with positively worded true-keyed 
| 8nd negatively worded false-keyed items loading positively, and 
| Negatively worded true-keyed and positively worded false-keyed 
_ items loading negatively. Since the denial of a negatively worded 
. false-keyed item is equivalent to affirming a positively worded 
true-keyed item, positive loadings on this factor are clearly at- 
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tributable to a general tendency to accept the traits implied by 
personality items as self-descriptive. This factor is therefore iden- 
tified as one reflecting item endorsement. It is especially notable 
that of the 44 personality and attitude scales in which direction 
of wording and direction of keying are systematically varied, all 
but one loaded in the appropriate direction. A biserial correlation 
coefficient between the magnitude of the factor loading and the 
expected direction of loading was computed, its value was .99. 
The one exception to this trend was a negligible .05 loading for 
the positively worded false-keyed Understanding selí-descriptive 
scale. This very high degree of separation between a wide variety 
of content scales on the basis of their acceptance keying and re- 
jection keying offers very clear support for the presence of an 
item endorsement process as distinguished from a true respond- 
ing process. It illustrates that it is possible to distinguish clearly 
between these two processes. By unconfounding the direction of 
keying from the direction of wording, it was possible to identify a 
number of both true-keyed and false-keyed scales loading positively 
or negatively on this dimension. 

Figure 1 presents the factor plot of Factors I and II. The scales, 
with the negligible exception referred to in the discussion of factor 
П, group themselves as predicted. The chi square computed on the 
number of relevant scales in each quadrant is significant at the 
01 level for each of the four wording and keying scale variations. 
The results also support the prediction that attitude items elicit 
true responding more strongly than item endorsement. Of the 16 
experimental attitude scales, 15 obtained higher loadings on the 
true responding factor than on the item endorsement factor (х 
=12.25, df = 1, p < 001; the loadings of the various scales 
derived from F scale items range from .37 to .49 on true respond- 
ing but only from .10 to .27 on item endorsement. The prediction 
that personality items elicit item endorsement more strongly than 
true responding, is not supported by these results. 


Factor ПТ 

Desirability, PRF Form B 68 
Desirability, PRF Form A 65 
Desirable adjectives 56 


MMPI acquiescence PF 49 
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MMPI acquiescence NT 41 
Undesirable adjectives —51 


Clearly, this is a desirability factor, with the highest positive 
loadings obtained for the two parallel PRF desirability scales and 
highest negative loading for the tendency to endorse undesirable 
adjectives. It is especially noteworthy that the highest loadings 
on this factor were obtained from item and adjective scales 
which were not predominantly comprised of pathological content. 
Such a finding thus would tend to disconfirm Block’s (1965) 
contention that desirability responding can be explained in terms 
of consistent responding to pathological content. Block based his 


FACTOR HI 
* ITEM ENDORSEMENT 


Figure 1. Factor plot of True Responding and Item Endorsement factors. 
Note that all true keyed scales have positive loadings on Factor I, and that 
Factor II separates scales reflecting differential tendencies to endorse charac- 
teristics as self-descriptive. i 
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analysis upon research with the MMPI, in which pathological 
content is usually present in the undesirable direction. When path- 
ological content and desirability are separated, there is neverthe- 
less a clearly defined desirability factor present. The loadings for 
the MMPI acquiescence marker scales on this factor are probably 
explainable in terms of the fact that although these scales were 
selected from the middle range of rated desirability on the 
MMPI, they tend more in the undesirable than the desirable di- 
rection because of the predominantly undesirable content of the 
majority of MMPI items. It is noteworthy that the acquiescence 
marker scales from the PRF item pool tended to obtain low load- 
ings on this factor. 


Factor IV 

Neutral adjectives 68 
Undesirable adjectives 47 
Desirable adjectives 43 


This factor, labeled adjective endorsement, is similar to the 
factor obtained by Jackson and Lay (1968) and supports their 
conclusion that adjective endorsement can be distinguished from 
other types of response style. 


Content Factors 


For each of the four content factors below, defining scales are 


listed together with the variable with the highest absolute ir- 
relevant loading. 


Factor V 
Exhibition (self-descriptive) NT 57 
Exhibition (self-descriptive) PT 55 
Exhibition (self-descriptive) PF 54 
Exhibition (self-descriptive) NF 47 
Exhibition (attitude) PT 41 
Exhibition (attitude) PF 40 
Exhibition (attitude) NF 31 
... ех —24 
Factor УТ 
Play (self-descriptive) NT 63 
Play (self-descriptive) PF 60 
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Play (attitude) NT 
Play (attitude) PT 
Play (attitude) NF 
Play (attitude) PF 
Play (self-descriptive) NF 
Play (self-descriptive) PT 
. ... Exhibition (self-descriptive) NF 


Factor VII 
Succorance (self-descriptive) PT 
Succorance (self-descriptive) NT 
Succorance (self-descriptive) PF 
Succorance (self-descriptive) NF 
Succorance (attitude) NF 
Succorance (attitude) PF 
Succorance (attitude) NT 
Succorance (attitude) PT 

. Sex 


Factor VIII 

Understanding (self-descriptive) PF 
Understanding (attitude) PF 
Understanding (self-descriptive) PT 
Understanding (self-descriptive) NF 
Understanding (attitude) NF 
Understanding (self-deseriptive) NT 
Understanding (attitude) PT 

. ... Relative F scale true-keyed 


39 
—38 
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These four factors are clearly associated with item content, 
Specifically Exhibition, Play, Succorance, and Understanding, те- 
gardless of the direction of keying or the direction of wording. 
Even though only relatively short scales comprised of items mod- 
crate in content saturation were employed, content was neverthe- 
less strong enough to emerge to define distinct factors. Of particu- 
lar interest is Factor VIII in which Understanding scales define 
one pole of the dimension and attitude scales in the style of the 
California F scale define the other pole of the dimension. This 
finding clearly supports the view that F scale attitudinal content 
is Substantially and negatively associated with a set of personal- 
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ity characteristics reflecting intellectual curiosity. It is also con- 
sistent with the frequently noted tendency of high scorers on the 
F scale to manifest relatively lower levels of education, intelli- 
gence, and yerbal skills. While all four versions of the Ё scale 
loaded negatively on this dimension, the highest negative load- 
ings were obtained for the true-keyed, positively worded forms, 
suggesting that these items, either because of their particular 
wording or because of some interaction between their positive 
form and the nature of the content, tend to reflect more clearly the 
opposite pole of a dimension associated with intellectual curiosity. 


Analysis of Relationships between Factors 


A second order factor analysis was performed to obtain further 
insight into the processes underlying the data. The oblique trans- 
formation matrix was pre-multiplied by its transpose, yielding & 
factor correlation matrix (see footnote 3). These correlations be- 
tween factors were in turn subjected to a principal components 
factor analysis and the factor loading matrix of the four largest 
principal components was rotated to a Varimax criterion. An ex- 
amination of the rotated second order factor loadings contained 
in Table 5 permits a number of conclusions. First of all, even at 
the second order level of analysis, the two acquiescence factors, 
true responding and item endorsement, appear to be independent. 
This supports the contention that these two processes are both 
conceptually and empirically distinct. Adjective endorsement, 
on the other hand, is strongly associated with item endorsement 


TABLE 5 
Rotated Second-Order Factor Matrix 
Second-O 

First-Order Factor I ca og nca IV 
I True Responding 24 —26 —05 
II Item Endorsement 90 12 Lm 14 
II Desirability -16 —02 —09 —92 
IV Adjective Endorse- 

ment 84 13 06 

V Exhibition —05 76 1% —19 
УГ Рау 29 69 15 00 
VII Succorance 04 68 —09 86 
VIII Understanding 10 —34 —80 —15 


Note:—Decimals were omitted. Salient loadings are in boldface. 
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supporting the hypothesis that there is a differential willingness 
to ascribe personality characteristics to oneself. Since both the 
item endorsement and the adjective endorsement factors are prom- 
inently represented on the first of the second order dimensions, it 
appeared appropriate to label this dimension trait endorsement. 
The second dimension of the second order analysis is clearly asso- 
ciated with scale content. Because the primary factors clearly 
distinguished appropriate scales, it is probably more appropriate 
to focus on the primary factors rather than on this second order 
factor in describing the scale content. The third second order 
factor again provided insight into the processes underlying stylis- 
tic responding. On this factor, the primary factor reflecting true 
responding is opposed to the primary factor prominently associ- 
ated with Understanding and endorsement of non-authoritarian 
content. In a number of earlier studies (Martin, 1964; Messick, 
1967), acquiescence was associated with intellectual ability. The 
results obtained here support this interpretation and clarify the 
personality disposition associated with true responding. The tend- 
eney to respond true indiscriminately, particularly to attitude 
items, appears to be negatively correlated with a dimension of 
intellectual curiosity. The fourth second order factor, desirability, 
has been well documented in the past and needs little further 
elaboration here. Had an oblique orientation of second order ref- 
erence factors been permitted, this dimension would have been 
associated with the trait endorsement dimension in such a way 
that individuals who respond desirably would tend to obtain 
lower scores on trait endorsement. 
Discussion 

The results indicate clearly that what has previously been la- 
beled acquiescence refers to two quite independent response styles 
—true responding and item endorsement. These response styles 
appear to involve quite different underlying processes, to produce 
different patterns of correlates, and to be independent. In general, 
Past attempts to measure generalized “acquiescence” have resulted 
in measures confounding these two distinct response processes. 
They can emerge clearly only when properly balanced combina- 
tions of positively and negatively worded items and of true and 
false keying are employed. Criticism directed at the response style 
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hypothesis based on low and nonsignificant correlations between 
putative measures of acquiescence consisting of items exhibiting 
different format properties, might better have been focused on the 
problem of uncovering the discrete processes elicited by these 
different item formats. 

In addition to identifying them as distinct factors, the present 
study also throws light on the properties and correlates of true 
responding and item endorsement, both in terms of the item prop- 
erties that elicited them and in terms of hypothesized underlying 
processes. One important finding is that attitude items, particu- 
larly items written in the style of the California F Scale, showed 
marked tendencies to elicit true responding, while personality 
items elicited item endorsement and true responding to about 
equal degrees. Furthermore, the results, particularly’ those de- 
rived from the second order factor analysis, suggest that an in- 
tellectual component underlies true responding. Specifically, true 
responding proved to be related to intellectual curiosity as re- 
flected in Understanding items from the PRF item pool and to 
rejection of authoritarian content. This is consistent with and en- 
larges on earlier findings by Martin (1964) and by Messick (1967) 
linking agreement with both original and reversed attitude items to 
low intellectual ability, low education, and lack of verbal ability. 
Messick’s (1967) analysis, in particular, based on a large number of 
factor analytic studies, is quite consistent with the present result. 
Messick suggested that acquiescence to attitude items—true ге- 
sponding in the present context—has an intellectual basis and 
that subjects who exhibit double agreements have some difficulty 
in processing the verbal content and ideas contained in the items, 
while acquiescence to personality items is not related to a lack of 
verbal skills, but to the subjects’ willingness to impulsively ac- 
cept or endorse the content as self-descriptive. The present find- 
ings serve to support this hypothesis and to explicate further 
the distinction between these processes. In the context of recent 
studies on the relationship between acquiescence elicited by at- 
titude items and conformity (Quinn, 1963; Stricker, Messick, 
and Jackson, 1968), the results reported here suggest that the con- 
formity manifested by individuals high on true responding may 
be mediated by relatively low levels of ability and competence 
(cf. League and Jackson, 1964). The present findings also imply 
that true responding rather than item endorsement is the process 
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accounting for the logically inconsistent responding demonstrated 
by Peabody (1961) on attitude items. 

While attitude items elicited primarily true responding, person- 
ality items elicited true responding and item endorsement to 
about equal degrees. Items from the neutral range of desirability 
were used in this study while most studies based on the MMPI 
involved items covering a wide range of social desirability. It ap- 
pears likely that one can manipulate the mix of true responding 
and item endorsement by manipulating item desirability. The re- 
sults of Rorer and Goldberg (1965a, 1965b), obtained using the 
MMPI, are most likely to reflect item endorsement. Rorer and 
Goldberg reversed the typically positively worded and true-keyed 
originals to negatively worded and false-keyed form. An orig- 
inal like “I am possessed by evil spirits” might thus be reversed 
as “It is not true that I am possessed by evil spirits.” Their re- 
sults most likely reflect reliable tendencies to accept positively 
worded originals and reject negatively worded reversals, or vice 
versa. Clearly, in the light of the present analysis, and of the 
Jackson and Messick (1965) study, this sort of consistency 
would be expected on the basis of item endorsement. 

The results also explain the well established finding that not 
all measures of acquiescence correlate highly and positively 
(Rorer, 1965). It should be noted, however, that while true re- 
sponding and item endorsement are elicited to different degrees 
by different types of test material—attitude items, self-descriptive 
items, and adjectives—the second order factor loadings indicate 
Some generality across test material for what is best called trait 
endorsement, although this factor remains quite distinct from 
true responding. 

A final point worth noting is that none of the speculations re- 
garding methodology raised earlier by Block (1965) in an at- 
tempt to discredit earlier response style research apply to the 
present study. There was no item overlap; every scale consisted 
of items that were not shared with any other scale. The intrusion 
of content was avoided by careful selection of items from a large 
item pool and by handling MMPI items separately, by balancing 
Acquiescence marker scales, and by accounting for content by 
Means of clearly identifiable content factors. Substantial response 
style consistencies continue to be uncovered even when the most 
elaborate methodological safeguards are employed. 
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OBJECTIVES AND ACHIEVEMENT MEASUREMENT: 
THE CONGRUENCY BETWEEN STUDENTS’ AND 
TEACHERS’ PERCEPTIONS OF BEHAVIORAL 
OBJECTIVES 


RICHARD Н. COOP ax» KINNARD Р, WHITE 
University of North Carolina at Chapel Hill 


Ir has become rather axiomatic for writers of textbooks on edu- 
cational measurement to state that the validity of measures of 
educational achievement is predicated on clearly communicated 
educational objectives. Payne (1968), for example, has asserted 
that the statement of educational objectives in terms of expected 
behavioral changes probably constitutes the single most impor- 
tant element in the development of an achievement test. The 
argument by Cronbach (1970) that the validity of achieve- 
ment tests is primarily a question of content validity also implies 
that the principal question to be asked concerning the validity of 
an achievement test relates to what performance individual 
items demand of the student. Since the selection of test items for 
achievement tests depends to a large extent upon the statement 
of objectives, the clarity with which proposed objectives com- 
шипісаќе the intended behavioral outcomes is of fundamental 
importance for constructing valid achievement tests. 

Tn an attempt to more clearly communicate educational objec- 
tives, a number of educators and psychologists have suggested 
that the broader goals of education be broken down into more 
Specific statements of the terminal behaviors expected of students. 
Mager (1962) was one of the pioneers in the effort to state edu- 
cational objectives in terms of observable student behaviors. 
Other writers (Walbesser, 1970; Popham, 1969) have also em- 
Phasized the necessity for more effective communication between 
teachers and students regarding the desired outcomes of education. 
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Criterion referenced instructional programs which utilize be- 
haviorally stated instructional objectives are currently receiving 
national attention. These instructional programs are based on 
the assumption that teachers can clearly communicate their edu- 
cational goals to their students. Both the critics (Atkin, 1963; 
Eisner, 1967) and the advocates (Popham, 1969; Mager, 1962; 
Walbesser, 1970) of behavioral objectives have debated the prop- 
osition that teachers can, in fact, clearly communicate their 
instructional intent to students and consequently construct valid 
measures of achievement. Unfortunately, neither critics nor advo- 
cates argue from empirical data. Rather, their arguments are 
conducted primarily from an a priori basis. Mager (1962) for 
example, presents no empirical evidence to support his contention 
that certain action type verbs and verb phrases more clearly 
communicate educational objectives than do other seemingly less 
overt verbs. Those who use behaviorally stated instructional 
objectives must either accept the contentions of experts, use their 
own a priori assumptions, or empirically determine through trial 
and error which verbs clearly communicate the teachers’ intent. 

One of the few attempts to systematically investigate the per- 
ceived behaviorality of verbs frequently used in statements of 
instructional objectives (Dena and Jenkins, 1969) used only 
teachers—plus one administrator—to obtain ratings. While it is 
certainly interesting to observe the reliability among teachers 
regarding their judgment of the behaviorality of verbs, it appears 
that a much more fundamental question concerns the congruence 
of perceptions between teacher and student as to what achieve- 
ment behaviors are expected of the student. If students do not 
understand what behaviors they will be expected to perform 
then the error term in the measurement process is inflated. 

The research reported here was designed to determine the con- 
gruency between teachers’ expectancy of student understanding 
and the students’ reported understanding of a number of verbs 


and verb phrases frequently used in behaviorally stated ob- 
jectives. 


Method 
Subjects 


Thirty-five teachers and 268 students in grades 9-12, and 13 
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teachers and 298 students in grades 4—7 served as raters. The 
teachers and students in grades 9-12 were in the same school as 
were the teachers and students in grades 4-7, however, the 9-12 
subjects and the 4-7 subjects were not in the same school system. 
All of the teachers and the students had had at least one year 
of experience with instructional programs using behaviorally 
stated objectives. 


Instrument 


A list of 150 verbs and verb phrases was compiled from be- 
havioral objectives currently being used either in grades 9-12 or 
in grades 4-7 in the schools where the raters were located. The 
teachers were presented with the 150 verbs and verb phrases and 
were asked to rate each verb according to the following choices: 
(a) less than 25 per cent of my students would know what would 
be expected of them when asked to do this; (b) between 25 and 
49 per cent of my students would know what would be expected 
of them when asked to do this; (с) between 50-74 per cent of 
my students would know what would be expected of them when 
asked to do this; and (d) 75 per cent or more of my students 
would know what would be expected of them when asked to do 
this, 

The students were presented with the same list of verbs and 
verb phrases and were asked to rate each verb according to the 
following choices: (a) I am sure that I would not know what to 
do it a teacher asked me to do this; (b) I can’t decide if I would 
know what to do or not if a teacher asked me to do this; and (c) 
І am sure I would know what to do if a teacher asked me to do 
this, 


Procedure 


The 150 verbs and verb phrases were presented to the raters on 
four, single column pages. Beside each verb or verb phrase, the 
teachers were asked to circle a number from one to four indicat- 
ing the percentage of their students whom they felt would know 
what to do if presented this particular verb or verb phrase in a 
behavioral objective. Students were asked to circle a number from 
1-3 indicating their certainty as to whether they would know 
what to do if a teacher asked them to do this. The subjects rated 
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the items during the same time period which did not exceed 55 
minutes. The rating was done in the spring. 


Results and Discussion 


Table 1 presents the verbs which 75 per cent or more of the 
students in grades 4-7 rated as being sure they would know 
what to do. This table presents these verbs in rank order ac- 
cording to the per cent of teachers in grades 4-7 who indicated 
that 75 per cent or more of their students would know what to do 
if presented with these verbs or verb phrases in an instructional 
objective. 

The data presented in Table 1 may be viewed as an index of 
congruency between elementary school teachers and students’ per- 
ception of the operationality of these verbs. It is particularly 
revealing to note that the students indicated certainty on a much 
larger number of the verbs than was predicted by their teachers. 
Seventy-five per cent or more of the students in grades 4-7 indi- 
cated that they were sure they would know what to do with 108 
of the 150 verbs whereas only 41 verbs were considered by their 
teachers to be clearly operational to 75 per cent or more of the 
students. In no сазе did a teacher consider а verb operational 
while а student did not. The data presented in this table also im- 
ply that there was considerably more variance among teachers 
across the list of verbs than among students. 

The results of the ratings of this same list of verbs and verb 
phrases by teachers and students in grades 9-12 may be observed in 
Table 2. 

The lack of congruency between teachers and students is even 
more apparent in the secondary school. Seventy-five per cent 
or more of the students in grades 9-12 indicated certainty of un- 
derstanding on 123 of the 150 verbs presented while their teach- 
ers considered only 23 of the verbs to be understandable by 75 
per cent or more of their students. These data also suggest con- 
siderably more variance among teachers than students across the 
list of verbs rated. Again, in no case did a teacher rate a verb 28 
clearly operational which a student did not. 

Two important points emerge from a comparison of the data for 
the elementary school with the data from the secondary school. 
First, all the verbs but one (to memorize) that 75 per cent oF 
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more of the students in grades 9-12 indicated were operational 
and which 75 per cent or more of the teachers in grades 9-12 in- 
dicated were operational were included in the same category by 
teachers and students in grades 4-7. In addition, there were 17 
more verbs which teachers and students in grades 4-7 agreed 
upon as being operational. This would appear to mean that there is 
a greater degree of congruency between teachers and students in 
grades 4—7 than teachers and students in grades 9-12. 

Secondly, it appears that teachers both in grades 4-7 and 9-12 
tend to agree with the a priori judgments of the experts as to the 
operationality of verbs and verb phrases more than the students 
do. Perhaps this is because the teachers have read the experts 
and the students have not. Verbs which have been judged on an 
a priori basis to have low operationality such as to understand, 
to feel, to know, to learn, to think, and to be curious were per- 
ceived by 75 per cent or more of both elementary and secondary 
school students as being highly operational. 

Tn addition to determining which verbs were rated high on op- 
erationality, those verbs which were frequently rated as com- 
municating complete uncertainty or indecision as perceived by 
the student were ascertained. Table 3 presents those verbs which 


TABLE 3 


Verbs Which 60 Per Сет or More of the Students in Grades 4-7 Indicated that they 
Were Either Uncertain or Would Not Know What to Do Ranked by Per 
cent of Teachers, Grades 4-7, Who Indicated 50 Per Cent or More 
of Their Students Would Not Know What to Do 


VERB % TEACHERS VERB % TEACHERS 
Sensitive To 92 Think Critically 85 
Critique 92 Utilize 85 
Deduce 92 Categorize Ti 
Give Debit and Credit 92 Contrast 77 
Partition 92 Demonstrate a Knowl- 
edge of 77 
Perceive 92 Detect 77 
Acknowledge 85 Distinguish 77 
Analyze 85 Role Play 77 
Appraise 85 Scan т 
Formulate a Generalization 85 Compute 69 
Generate 85 Convert 69 
Infer 85 Evaluate 69 
Secure 85 Diagram 54 
Dramatize 54 


Student N — 298. 
Teacher N — 13. 
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50 per cent or more of the students in grades 4-7 rated as 
municating complete or partial uncertainty. The verbs are 
ordered by the per cent of teachers who indicated that 50 
cent or more of their students would not know what to do. ў 

Table 4 presents the same data for students and teacher 
grades 9-12. 

The 50 per cent criterion used in these tables appears to b 
rather conservative criterion judged by the standards set by i 
rent writers on objectives. Frequently, a criterion of 80-90 
cent of the students being able to perform to criterion is suggested 
(Bloom, 1968; Popham, 1969). Certainly if 50 per cent or m 
of the students cannot clearly understand what behavior is 
tended by the objective then such a performance criterion со 
not be expected. 

The data presented in Tables 3 and 4 indicate that only 
of the 150 verbs were perceived by students in grades 9-12 
uncertainty. Among students in grades 4-7, 27 verbs were 
ceived with uncertainty. It is most probable that the vocabul 
level of the younger students can account for the fact that о 
three times as many verbs were viewed with uncertainty by stu 
dents in grades 4-7 than by students in grades 9-12. All verbi 
considered to be unclear by students in grades 9-12 were inc 
in the list of unclear verbs as rated by students in grades 4 
Also, for every verb rated by students in either grades 4-7 0! 
grades 9-12 as unclear, teachers in the respective grade leve 2 


ТАВЬЕ 4 


Verbs Which 60 Per Cent or More of the Students in Grades 9-12 Indicated 
They Were Either Uncertain or Would Not Know What То Do Ranked by 
‘er Cent of Teachers, Grades 9-12, Who Indicated 50 Per Cent or — 
More of Their Students Would Not Know What to Do 


Critique 


Perceive 89 
Formulate a Generalization 86 
Inter 86 
Appraise 83 
Generate 80 


Partition 

Give Debit and Credit 
Student N = 268. 
Teacher N = 35. 
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dicted that the verb would be viewed by their students as uncer- 
tain. 


Summary 


Although the specification of educational objectives in behay- 
ioral terms has received considerable attention, virtually no at- 
tention has been given to the proposition that the objectives 
themselves are data. Objectives, like all other educational data 
would seem to be subject to error. In the construction of measures 
of educational achievement, errors due to faulty objectives would 
seem to be a primary contribution to invalidity. If the unam- 
biguous communication of educational objectives is to result in 
the improvement of student performance as claimed by Mager 
and McCann (1962) and Walbesser (1970) and in the improve- 
ment of the construction of measures of educational achieve- 
ment as claimed by Payne (1968) and Popham and Husek (1969) 
then the objectives used must be known to communicate the 
teacher’s intent to the student. The congruency ranking of the 
150 objectives presented in this research seems to indicate that 
both teachers and students can agree on the intended behaviors 
demanded by a substantial number of instructional objectives. 
However, the phenomenon of students claiming to understand 
the intended operationality of the large number of verbs which 
teachers believe they could not understand demands further em- 
Pirical study. Perhaps an even more crucial question for fur- 
ther research is whether or not those students who claim they are 
sure they would know what to do when they are presented with 
certain behavioral statements actually perceive the same inten- 
tional meaning of the statement as do their teachers. 
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THE RELATIONSHIP OF PARENTAL PERCEPTIONS 
OF UNIVERSITY LIFE TO THEIR CHARACTERIZATIONS 
OF THEIR COLLEGE SONS AND DAUGHTERS 


ROBERT D. BROWN 
University of Nebraska 


JusT as it is important to understand student perceptions of 
the campus environment because of interactions between these 
perceptions and behaviors, it is also important to determine what 
influences others’ perceptions of the same environment. Parents 
Would appear to be a critical, as well as a long neglected popula- 
tion. How much do parents really know about the campus climate 
and the pressures the campus environment places upon their col- 
lege sons and daughters? Is there a relationship between parental 
Perceptions of university life and how they characterize their own 
Sons and daughters? Having a child attending a college probably 
Operationally means an occasional visit to the campus and being 
slightly more attentive to newspaper stories about the college. 
However, reports parents receive about campus life from their 
Sons or daughters could well be selective and biased. The paren- 
tal image of the campus may be no more accurate than that of 
the general public and may be shaped by their image of their own 
College son or daughter, as much as it is by what they actually 
know about the environment. 

The purpose of this investigation was threefold: (1) to assess 
Parental perceptions of a university environment and compare 
these Perceptions with those of students, (2) to compare the per- 
Ceptions of parents of entering freshmen with those of parents of 
upperclassmen, and (3) to investigate whether or not parental 
Perceptions of the university were independent of how they char- 
acterized their own college sons or daughters. 
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Most of the research related to these questions has focused on 
student populations. For example, degree of familiarity with the 
college has been found to have an impact upon how students 
perceive the campus. When the viewpoints of freshmen and up- 
perclassmen were compared, freshmen were found to have ideal- 
istie and unrealistic views of the campus (Berdie, 1966; Feld- 
man and Newcomb, 1969; Johnson and Kurpius, 1967). Pace 
(1966) reports that the pattern of environmental perceptions for 
different groups of students are essentially the same, but on the 
basis of several studies he strongly recommends using third 
semester students as reporters. The same pattern might be pre- 
dicted for parents. Parents of entering freshmen might be ex- 
pected to reflect views of the campus environment similar to those 
of the general public (Evans, 1970), whereas parents of upper- 
classmen should be better informed and more accurate. 

The possibility that parental perceptions of the campus are re- 
lated to characterizations of their own sons or daughters rests on 
several premises. There is the possibility that most parental per- 
ceptions of the campus are based on rather limited input, perhaps 
much of it is what they hear from their own children, If there is 
а perception among parents that the environmental press is strong 
in the scholarship arena, for example, this might be partly be- 
cause of verbal reports from their own college student. It might 
also be related to how they view their own son or daughter. If 
their son or daughter is a good student and scholarly, it is possible 
that the parents would tend to see the campus environment as 4 
setting which reinforced these interests. In an effort to maintain 
congruency between the two perceptions, of their child and of the 
college, parents might be projecting characteristics of their child 
on to their view of the campus (Bruner, 1957). 

Among students there is some evidence that there is an inter- 
action between student personality characteristics and their рег- 
ceptions of the campus climate. Yonge (1968) and Marks (1968) 
both set out to test whether student characteristics and their 
perceptions of the environment were related, or whether they were 
two separate and distinct domains. Both found personality and 
motivation to be related to environmental perceptions. Whereas 
Berdie (1967) ascribed differences between campus groups 10 
the actual many-faceted aspects of the university community, 


] 
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Yonge and Marks suggested that in reality there were as many 
versions of the campus, which were functionally different, as 
there were individual students. 

This evidence at least raises the possibility that students could 
be biased reporters of the campus scene to their parents and con- 
sequently parents are likely to see the campus through the eyes 
of their son or daughter. 


Instruments 


A revised version of the College and University Environment 
Scales, Second Edition (CUES) and The Adjective Check List 
(ACL) were chosen as the instruments for this study. The CUES 
has been used in а growing number of campus environment stud- 
ies and its five scales: Practicality, Community, Scholarship, 
Awareness, and Propriety, provide data based on factorially de- 
tived dimensions, which collectively provide a comprehensive pro- 
file of the campus climate. The ACL is a brief, nonthreaten- 
ing personality inventory, which yields a profile based on Murray’s 
15 need-press personality dimensions (Murray, 1938). Its format 
makes it particularly suitable and adaptable for third person 
descriptions. 

The CUES was revised so that items with references to “here” 
or “at this campus” were reworded to read “at the University” or 
“at the University of Nebraska.” This resulted in changes in 10 
of the 100 items of the CUES. Provision was also made for re- 
Spondents to indicate their degree of certainty on a 4-point scale 
(0 = а guess, 1 = some idea, 2 = pretty sure, 3 = very sure). 

With the ACL were special instructions asking respondents to 
check the adjectives which best described their son or daughter 
who was entering college or now attending. Standard scores are 
Provided for each scale which are pro-rated depending upon the 
total number of adjectives checked (Gough and Heilbrun, 1965). 


Sample and Procedure 


The CUES and ACL were mailed to two random samples of 
Parents (100 parents of freshmen and 100 parents of sophomore 
and junior upperclassmen) of University of Nebraska students. 
Useable returns were obtained from 160 parents, 85 parents of 
freshmen and 75 parents of upperclassmen. Demographic data 


368 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


were also collected on the parents and comparisons between par- 
ents of freshmen and upperclassmen yielded no significant differ- 
ences on size of home town, educational background of the father, 
or distance from the University. Only those with one child in col- 
lege were included in the sample. 

Student CUES profiles collected over a three year period were 
available on 350 students, who represented various living unit 
complexes. A random sample of 25 profiles from this population 
were combined with a random sample of current student profiles 
to construct a student picture of the campus environment. 


Scoring and Analysis 

The CUES scoring procedure outlined by Pace (1969) was in- 
tended to obtain a consensus description of the environment 
rather than an individual score. A consensus score is obtained by 
adding the number of items answered by 66 per cent or more of 
the respondents in the keyed direction, subtracting the number 
of items answered by 33 per cent or fewer in the keyed direction, 
and adding 20 points to the difference. This procedure was em- 
ployed for purposes of comparing student and parent profiles. 
The responses of individuals were also scored in the traditional 
psychometric fashion for the purpose of obtaining individual 
scores. Chi-square analyses were made to determine whether 
or not there were significant differences in the responses of parent 
and student groups to individual items of the CUES. 

The individual parental portrayals of the campus environment 
on the five CUES scales were correlated with the standard scores 
of the ACL descriptions of their sons or daughters. Separate 


correlations were computed for parents of freshmen and parents 
of upperclassmen. 


Results 
Comparison of Parents and Students 


Table 1 presents the CUES scores for parents and students 
using Pace's consensus scoring procedure. In all instances, whether 
it was parents of freshmen or parents of upperclassmen, the ра!“ 
ents’ portrayal of the campus environment ranked substantially 
higher than the students’ portrayal, when these scores were COP” 
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pared to the reference group of 100 colleges. The pattern was 
consistent for all five environmental scales with the greatest dis- 
crepancies appearing on the Scholarship and Community scales. 

Analyses of responses to individual items on the CUES re- 
vealed that there were 27 items on which both two-thirds of the 
parents and the students agreed. Most of the items for which 
there were both parental and student consensus centered on as- 
pects of the environment related to campus rules and regulations, 
and on whether or not “good fun and school spirit” pervaded the 
campus scene. These items were chiefly from the Practicality and 
Propriety scales. 

There were 45 CUES items on which the majority of students 
held different opinions than the majority of parents, and chi- 
square analyses resulted in significant differences for 27 of these 
items. CUES items on which there were significant differences 
of opinion focused on topics related to the Scholarship and Aware- 
ness scales. Parents tended to see the environment reflecting a 
much greater emphasis on scholarship, intellectual activities and 
cultural events than did the students. Students saw the campus 
as less academic and more restrictive than did parents. 


Comparison of Parents of Freshmen and Upperclassmen 


Table 2 presents the CUES results on the five scales after they 
were scored in the traditional psychometric fashion, and are re- 
ported in terms of means and standard deviations along with the 
certainty scores. There was a significant difference between the 
two parental groups on the Practicality scale with parents of 


TABLE 1 
CUES Consensus Scores and Percentile Ranks for Parents and Students 


Upperclass- 
Freshmen men Total 
Parents Parents Parents Students 
М = 85 М = 75 М = 160 М = 50 


Scale Score PR Score PR Score PR Score PR 
Practicality 31' 295: 381. HOB 1182 сө OTS дА 18 
Scholarship 37 98 36 93 38 98 21 4 
Community 31 78 28 68 29 74 20 26 
Awareness 34 94 34 94 35 м 21 62 
Propriety S22 79.5890. 18421921. SAET 


Note.—Scoring procedure and norms are provided in Pace (1969). 
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TABLE 4 | 
Correlations between CUES and Adjective Check List for Parents of Upperclassmen 


Practi- Commu- 
Scholarship nity Awareness Propriety М 

Need Scales د‎ 
Achievement .025 .090, .959** .149 .261* 56.58 
Dominance .053 .124 .366** .146 .158 54.78 
Endurance —.075 „066 _ . 291" .256* .322** 56.86 
Order .049 .052 .424** .214 .376** 54.63 
Intraception —.148 —.097 .046 .191 .237* 51.45 
Nurturance —.123 —.021 .057 .195 .240** 52.86 
Affiliation —.070 .037 .114 21 .374** 50.41 
Heterosexuality «099 .119 .106 .153 .093 51.48 
Exhibition .243* 171 .012 012 —.100 50.06 
Autonomy 056 .038 —,.054.  —.160  —.427"" 47.50 
Aggression .133 .069 -003 —.160 —.809* 49.14 
Change 153 .148 | —.031  —.145 —.095 44.45 
Succurance .067 2049 —.194 —,184 1039 145.38 
Abasement —.075 —.104 —.161 —.016 .181 47.72 
Deference —.182 —.102 — .039 117 .367** 51.38 
Supplementary 

Scales 
Defensiveness  —.055 —.028 .243% .146 .363** 54.47 
Self Confidence 174 .163 .305* — .101 .027 52.88 
Self Control —.157 —.063 -131 .160 .303* 56.08 
Lability — .042 162 —.074 .043  —.048 47.38 
Personal Г 

Adjustment —.125 —.090 132 .227 .259* 51.58 

* s Significant at .05 level. 
t at .01 level. 


campus atmosphere and descriptions of their sons or 
were not independent for parents of upperclassmen. 


Discussion 
The results of the comparisons of CUES profiles between par- 
ents and students, and between parents of freshmen and upper- - 
classmen suggest that this instrument can be a useful device for і 
determining parental perceptions of the campus environment. | 


Consensus among the parents was greater than it was for the _ 
students. The degree of certainty varied depending upon whether - 
or not they were parents of entering freshmen or of returning E 
students, but the overall level of certainty indieated that the _ 
respondents felt they were doing more than just guessing. Though - Ж 
there were significant differences between the two parental groups _ 


in their degree of certainty, their perceptions of the environment. i | 
were quite similar, oe 
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The tremendous discrepancies between student and parent рег- 
ceptions of the environment, regardless of the experience (new or 
returning) their son or daughter had with the University, sug- 
gest that even though parents may have a “reporter on the scene,” 
their perceptions are still not congruent with those of students. 
Parents of upperclassmen remained idealistic, seeing the campus 
as an intellectual beehive and the college administration as highly 
benevolent. This raises questions about how much students talk 
with their parents about campus life, aside from their own 
goals and academic achievements. How much do they discuss 
what happens day-by-day in their classes, the kind of examina- 
tions they have, or what extra-classroom activities are like? If 
discussions take place between parents and students about cam- 
pus life, the finding of relatively few differences in this study be- 
tween parents of freshmen and upperclassmen suggests that this 
communication has little impact upon how parents perceive the 
campus environment. 

The cluster of relationships between the perceptions of parents 
of upperclassmen about the Community and Propriety dimen- 
sions of the campus and their ACL characterizations of their sons 
and daughters center around personal characteristics related to 
task-orientations and interpersonal relationships. Parents who saw 
the campus as high in Propriety, which suggests a campus atmo- 
sphere that is polite proper, conventional, cautious and where 
group standards are important, also saw their son or daugh- 
ter as moderate (low Autonomy), conformist (low Aggression), 
self-denying (high Deference and Defensiveness), as well as hard- 
working (high Achievement), patient (high Endurance and Self- 
Control), and organized (high Order). These characteristics are 
quite compatible with an environment portrayed as being high 
on the Propriety CUES scale. 

The relationships between the CUES Community scale and the 
ACL son or daughter descriptions for the parents of upperclass- 
men are less easily seen as compatible, unless the campus com- 
munity is seen as striving, and achievement-and goal-oriented. 
Parents who saw their children as hard-working (high Achieve- 
ment), forceful and outgoing (high Dominance), responsible (high 
Endurance), organized (high Order) and confident (high Self- 
Confidence), also tended to see the campus as friendly, cohesive 
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and group oriented. This pattern is not contradictory, as there 
appears to be more of a strain of optimism, idealism and trust as- 
sociated with this ACL profile than aggressive competitiveness. 

The finding that perceptions of parents of freshmen were on the 
whole independent of their characterizations of their children, 
whereas those of parents of upperclassmen were not, suggests pos- 
sibilities that must remain speculations at this time. Marks (1968) 
found that uncertainty about the environment in a student pop- 
ulation was more likely to lead to portraying the campus as con- 
gruent with certain personal characteristics than was certainty. 
But in this study the relationships between parental portrayals 
of their sons or daughters and their perceptions of the campus 
were significant only for the group of parents who were more 
certain of their responses about the environment. If cognitive 
dissonance (Festinger, 1957) between parental self-image and 
their characterizations of the campus were a factor, one would 
expect this to be equally true for parents of freshmen and parents 
of upperclassmen. 

For parents of upperclassmen the need to reduce dissonance 
might well be operating in a different fashion, as they attempt to 
put together what they hear about the campus from their son ог 
daughter, what kind of person they picture them to be, and their 
own image of the campus environment. It is possible that the 
students who had been on campus for several years were selec- 
tive reporters, discussing aspects of campus life that concerned 
them the most, and which also reflected their own interests and 
characteristics. It is also possible, that parents were selective 
listeners as well. 

Further research might well follow the pattern of that done 
with student populations. Initial efforts could assess perceptions 
among different sub-populations, such as urban and rural area 
parents, college and non-college educated parents; and later ef- 
forts could explore possible individual characteristics related to 
environmental perceptions. The next step would be to give more 


attention to ways of both changing the environment and making 
perceptions more realistic. 
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ATTITUDE CHANGE AS A FUNCTION OF 
COMMITMENT, DECISIONING, AND INFORMATION 
LEVEL OF PRETEST! 


T. А. NOSANCHUK, LEON MANN ano IRENE PLETKA? 
Harvard University 


А widely recognized problem associated with the standard pre- 
test-treatment-posttest design in attitude change research is that 
the process of measurement per se may change the attitude un- 
der investigation. So-called reactive effects occur whenever the 
testing process is in itself a stimulus to change or to maintain a 
habit, rather than a passive record of behaviour (c.f. Campbell 
and Stanley, 1966. p. 8). Thus, in an experiment on the effect of 
& persuasive communication on attitude change the initial pre- 
test might act as a stimulus to modify opinions or to resist influ- 
ence even without the persuasive message. 

The present study is concerned with three features of the usual 
attitude pretest, communication of information, commitment and 
decisioning, and their effect on opinion change. 


Informational Inputs 


In attitude change research the typical pretest may contain 
informational inputs which interact with information contained 
in the persuasive communication to boost the magnitude of change, 
even when the items are couched as mere affirmative statements. 
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The effect of a set of pretest items might be “to inform (the in- 
dividual) that such а position exists, to induce him to consider 
it, to notice and examine pertinent information, and possibly 
to change his attitude" (Nosanchuk, 1970. p. 13). In brief, we 
assume that information embedded in the pretest has an advertis- 
ing or priming effect which may facilitate opinion change. 


Commitment 


A seemingly incidental, but none the less important conse- 
quence of taking а pretest is the psychological effect of signing 
one's name to the questionnaire. Lana (1966) has postulated 
that а basic dimension of pretest-taking is the degree of external 
commitment to the initial position recorded prior to the persuasive 
communieation. When the subject signs his name to a pretest in 
such а way that it becomes part of the experimenter's records, 
he has engaged in a form of public commitment which tends to 
anchor him to his initial pretest position. Accordingly, it is as- 
sumed that the act of responding publicly to the pretest by sign- 
ing one's name to it, will function to inhibit subsequent opinion 
change. 

Decisioning 

А third, related aspect of pretest behaviour is the process of 
arriving at a decision in the form of selecting and checking between 
response alternatives to pretest items. In line with Lewin’s (1958) 
and Pelz’s (1958) emphases on the “freezing” effects of arriving 
at 5 decision, it can be argued that the process of making a de- 
cision required in giving answers to a set of attitude items may 
have an immunization effect against subsequent influence at- 
tempts (eit, McGuire, 1964). We assume, then, that the process 
of decisioning involved in responding to a pretest will function 
to inhibit subsequent opinion change. 

In sum, the present study aims to test three hypotheses related 
to the effects of attitudinal pretests on subsequent opinion change. 

1. That the greater the number of informational inputs con- 
tained in the pretest, the greater the magnitude of attitude change 
mediated by the persuasive communication. 

2. That the act of responding publicly to the pretest by signing 
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one’s name to it will function to inhibit opinion change mediated 
by the persuasive communication. 

3. That the process of selecting between and checking pretest 
responses functions to inhibit opinion change mediated by a sub- 
sequent persuasive communication. 

The experimental procedures for testing these hypotheses are 
described in the next section. 


Method 
Overview 


An attitude change study employing the standard pretest-post- 
test design was conducted with a sample of college students. The 
issue of attitudes toward vivisection was used as a vehicle for the 
study. Three levels of pretest information and two levels each of 
commitment and decisioning were manipulated in a multi-treat- 
ment group design to test for the effects of these three independent 
variables alone and in interaction. 

The pretest format was varied systematically across conditions 
во that subjects responded to items containing high, medium or 
low levels of information on the attitude issue (information vari- 
able), signed or did not sign their names on the pretest question- 
naire (commitment variable), and answered or did not respond 
to the pretest items (decisioning variable). Two control groups, 
in which subjects were not exposed to any form of pretest, were 
included in the design. The pretest was followed by a persuasive 
Pro-vivisection communication. This in turn was followed by an 
attitude posttest on the issue, to assess the effects of the three 
independent variables on magnitude of opinion change. 


Subjects 


The subjects were 136 students in three introductory psychol- 
ОБУ classes at Framingham State College, and one introductory 
Class at Northeastern University. The subjects were unpaid, and 
Participated in the study during their regular class meeting. 


Procedure 


Within each class, students were randomly allocated to one of 
е nine experimental conditions and two control conditions (see 
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Table 1). Each subject received a six-part booklet which varied 
in format according to experimental condition. A brief verbal 
introduction was given, instructing subjects to work independently 
and not to turn back to completed parts of the booklet. 

Commitment manipulation. The first part of the booklet, the 
cover sheet, contained brief instructions for completing the 
booklet, together with a space in the upper left corner for subjects 
to enter their name and sex (commitment condition only). In the 
noncommitment condition the space for name was conspicuously 
blocked out. The next part consisted of a neutral 200-word mes- 
sage on vivisection (Lana, 1959) the purpose of which was to 
familiarize subjects with the issue of vivisection. 


Information manipulation. The next section of the booklet con- | 


tained a five-item attitude pretest which varied in the amount 
of detail and facts it conveyed in order to establish the informa- 
tion manipulation. The low level consisted of five simple declara- 


tive statements. The medium level was made up of the same stem _ 
statements, to which several informative clauses were added. The : 


high information items, constructed in the same way, consisted of 


the medium information items with additional informative clauses. y 
Here is an example of how basie item statements were embel- ЕІ 
lished with additional facts in order to boost the level of P | 


mation contained in the pretest (Item 5). 


Low Information 
We should encourage vivisection research in the future. 


Medium Information 


Because medical research on animals is much less expensive 


than other kinds, we should encourage vivisection research in 
the future. 


High Information 


Much more basic research can be done because medical re- 
search on animals is much less expensive than other kinds, so Wê 
should encourage vivisection research in the future. 

All items were slanted in a pro-vivisection direction, and all 


offered six response alternatives, from "agree strongly" to “dis- 
agree strongly.” 
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Decision manipulation. At each of the three information levels, 
subjects in the “No Decision” condition were instructed to read 
but not answer the pretest questions. Subjects in the decision 
condition responded to the pretest by checking response altern- 
atives. 

Subjects in the control group were given no pretest at all (see 
Table 1). 

Persuasive communication. Part Four consisted of a 400-word 
pro-vivisection communication taken from Lana (1959). 

Attitude posttest. Part Five was the posttest, consisting of 
five items on the issue of vivisection derived from the Lana- 
Molnar Scale (Lana, 1959). The items were similar to the pre- 
test items, but consisted of belief statements without supporting 
facts or information. Here is an example (Item 4). 

Animal experimentation is justified if the animals do not suf- 
fer. The response alternatives were the same as on the pretest. 

Post-experimental debriefing. The final page of the booklet con- 
tained two questions soliciting the subject’s views of the pur- 
pose of the experiment and any criticisms he may have had. 

After all the subjects had completed their booklets, they were 
informed of the nature of the experiment and all their questions 
were answered. Several weeks after completion of the experiment 
subjects were sent a brief report on findings from the study. 


TABLE 1 
Experimental Design for Testing Effects of Information Level, Commitment and 
Б, 


— —_—____ 
Level of Information Contained in the М 


опе 
High Medium Low (Control) 


шша: Decision 
es name and 
РЕ Е №-16 N=16 N=16 N=8 
Хе Commitment Decision 
оез not affix name 

but answers pretest) Жеб у= N19 Ne 
No Commitment 
No Decision 

(S neither affixes name 

nor answers pretest) N= 8 N= 8 мы: 
а аф еа ПР ied а E 


N = 136. 


382 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Results and Discussion 


Mean attitude scores for groups in each of the experimental 
and control conditions are presented in Table 2. ¢ tests for com- 
paring control groups with each of the treatment conditions 
were conducted separately from the main analysis. Comparisons 
between control groups were also made separately (see Table 2). 

For both the commitment-decision (Row 1) and the no-com- 
mitment-decision (Row 2) conditions the combined means of the 
three information levels were not significantly different from 
those of the control group: an unexpected finding, given the fre- 
quency with which pretest effects have been found (c.f. Lana, 
1969). 

The principal analysis, made on the posttest scores, was а 
Model 2, 3 X 3 factorial analysis of variance with proportional 
cell sizes. The results are presented in Table 3. None of the main 
effects nor any of the interactions was significant. The row sum 
of squares was partitioned into two orthogonal components: the 
comparison between commitment-decision (22.40) and no-com- 
mitment-decision (22.77) was not significant. However, the com- 
parison between decision with and without commitment (22.59) 
and no-commitment-no-decision (20.46) was statistically signifi- 
cant. It would appear then that the only significant effect is that 
of the subject’s making a decision. This would appear to support 
Pelz's (1958) finding of a decision effect of some sort, but here 


making a decision seems to have facilitated change rather than 
inhibited it. > 


TABLE 2 
Mean Attitude Scores on Рознем by Information, Commitment and Decision 

Condition 

Information Level 
A Row 
High Medium Low Means Control 
Commitment Decision 22.9 
No Commitment ЖА шз. S din Рај 
Decision 22.9 

NL MU 23.0 23.4 23.1 23.4 
No Decision OS es 1 / 
Column Means 24 nr 18 7 ~ 
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ТАВГЕ 3 
RAW SCORES: ANOVA 
Summary Table 

SOURCE 55 df MS F 
ROWS 109.5 2 54.25 n.s. 

Row 1 vs. 2 12.3 1 12.3 пз. ) 

Row 1 + 2 vs. 3 97.2 1 97.2 4.4 (Р < .05) 
COLUMNS 17.4 2 8.7 ns, 
INTERACTION 31.1 4 7.8 n.s. 

RROR 2457 111 22.1 

TOTALS 2615 119 


Row 1: Commitment— Decision. 
Row 2: No Commitment— Decision. 
Row 3: No Commitment—No Decision. 


Level of Pretest Information 


The present study is not the first to investigate the facilitating 
effects of an attitude pretest (c.f. Pauling and Lana, 1969), but, 
to our knowledge, it is the first in which the level of informational 
inputs was varied in order to assess their influence or receptivity 
to a subsequent persuasive communication. The amount of infor- 
mation contained in the pretest was systematically varied by 
adding facts and supportive evidence to the stem item state- 
ments. We hypothesized that the number of informational inputs 
contained in the pretest would be positively related to the mag- 
nitude of attitude change. It was found, however, that the level 
of pretest information had no significant effect on acceptance of 
the persuasive communication, either as a main effect (F < 1) 
or in interaction with the commitment-decision variable (F < 
1). It is conceivable that the vivisection issue is one on which 
Most subjects had a great deal of knowledge. Therefore, al- 
though extra information was presented to some subjects, it may 
have been information with which they were already familiar. 


Commitments Effects 


While we hypothesized that information contained in the pretest 
Would serve to facilitate subsequent opinion change, we also hypoth- 
esized that the act of committing oneself to an attitudinal position 
by responding publicly to the pretest would function to inhibit 
Opinion change. Operationally, commitment was defined, for the 
Purposes of this study, as affixing one’s name to the questionnaire 
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which included the pretest. Comparing subjects who responded to 
the pretest anonymously (X = 23.1) with those who were required 
to put their name on the pretest (X = 22.40) it is seen that the 
commitment manipulation had little or no substantial effect (F > 1). 
This is inconsistent with the finding of Pauling and Lana (1969), that 
а minimum commitment associated with responding to a pretest 
anonymously was maximally effective in allowing for attitude change. 
A major difference between the Pauling and Lana study and the 
present one is that, while Pauling and Lana left an interval of three 
to seven days between administration of the pretest and presentation 
of the persuasive communication, our subjects received the communi- 
cation immediately after the pretest. It is possible that the length 
of the pretest-communication interval is a major determinant of 
whether or not the pretest will have a committing effect. Living 
with the knowledge of a commitment for a period of several days 
may be optimal for anchoring a subject to his initial position, whereas 
a brief delay may not produce this effect. However, in another 
experiment Lana and Rosnow (1968) found no differences between 
immediate and delayed posttest administration in influencing attitude 
change toward vivisection. 


Effects of Decisioning 


Operationally, decisioning was defined as answering the pretest 
by checking responses to the pretest items. Subjects in the non- 
decision condition were instructed to read the items but to refrain 
from checking their opinion responses. Decisioning is conceptually 
distinct from commitment, in that the former presupposes cognitive 
work leading up to a selection between competing alternatives, while 
the latter is an act of publicly acknowledging the stand one had 
taken in choosing an alternative. Presumably making a private 
decision, because of its “freezing” effect, serves to inhibit subsequent 
attitude change (c.f. Lewin, 1958). Pelz (1958) found a significant 
tendency for subjects who were asked to come to a private (or at 
least anonymous) decision to persist longer in their intention than 
those not so requested. But as McGuire (1964) points out, private 
decisions are the most tenuous kinds of “commitment”, and there 
have been numerous studies in which pretests have failed to "freeze" 
or “immunize” the individual against subsequent persuasive attempts 
(c.f. Lana, 1959; Hicks and Spaner, 1962). 
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Contrary to the experimental prediction, the decision manipulation, 
rather than inhibiting acceptance of the communication, served to 
facilitate attitude change. Regardless of level of commitment, sub- 
jects who made a decision by checking response alternatives on the 
pretest showed more acceptance of the pro-vivisection message 
(Х = 22.75) than subjects who read but did not check the pretest- 
items (X = 20.46), (F = 4.4). 

Why did the decision act to facilitate rather than inhibit subse- 
quent attitude change? A number of equally plausible explanations 
suggest themselves. Perhaps the process of decisioning has a sensi- 
tizing effect which motivates the subject to attend more carefully 
to a persuasive communication which advocates congruent change. 
This explanation presupposes that the activity of answering the 
pretest renders the issue more salient and important to the respondent. 
When the communication advocates a more extreme pro-norm 
position the subject will more readily learn and accept the persuasive 
inputs. Another explanation refers to attitudes toward the experi- 
mental situation as a whole, and the subject’s motivation to play 
the “good” subject. The juxtaposition of pretest, communication 
and posttest may have alerted subjects to the experimental “demand” 
of attitude change. In the decision condition, where the pretest 
aspect of the design was especially prominent, subjects may have 
been particularly motivated to show acceptance of the communica- 
tion in order to “help the hypothesis.” Unfortunately, it is impossible 
to settle the issue with the available data. The present findings 
indicate, however, that the effects of a pretest decision on receptivity 
to subsequent persuasive inputs is a complex problem which deserves 
further, careful investigation. 
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SUBJECT REACTIVITY TO MULTIPLE EVALUATIONS 
IN А FIELD SETTING: IMPLICATIONS FOR 
VALIDITY OF EVALUATION STUDIES 


ROSS J. LOOMIS, JAMES W. KELLEY aw» J. STANLEY AHMANN 
Colorado State University 


In recent years government sponsored projects in the educa- 
tional field have been subjected to some kind of planned evalua- 
tions. Frequently, more than one evaluation is made of an on-going 
project. A question that can be asked is concerned with the effect 
of such multiple evaluations or monitorings of educational proj- 
ects on the performance of the project, morale of personnel, or 
frustration level of teachers and students (Webb, Campbell, 
Schwartz, and Sechrest, 1966). This is a particularly salient 
question in light of the evidence of the reactivity of persons to 
the measurement process. For example, the work of Webb et al. 
(1966) has stressed the role of unobtrusive measurements and 
problems connected with reactivity to the measurement process. 
The present paper explores the feasibility of the measurement of 
Possible components of reactive behavior that can contribute to 
the generation of sources of invalidity in evaluation studies. 

Method. A number of project directors funded by the U. 8. Office 
of Education were asked to react to a short inventory question- 
ing them about their attitudes toward multiple evaluations. Three 
hundred and twenty-five project directors responded to the instru- 
ment which was contained in the final portion of an evaluation 
questionnaire being given by the authors. The evaluation itself 
Was part of a larger project being conducted at that time (Ah- 
Mann, Loomis, and Kelley, 1971). The reactivity instrument was 
made up of 13 items displayed in Table 1. 

In general, the items were designed to measure possible effects 
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TABLE 1 
Response Percentages for Project Directors to Multiple Evaluation Questions 
Project Staff Morale 

SA A р 8р 

1. Multiple evaluations have little effect upon the 
morale of the project personnel. 7 22 49 15 

2. Multiple evaluations increase the unwillingness 

of project personnel to engage in subsequent new 
projects. 12 12 33 85 

3. Multiple evaluations contribute to the loss of 
confidence in evaluation generally. 11 42 34 3 

4. Multiple evaluations have по effect on соорега- 
tiveness of project personnel toward evaluators. 2 22 54 13 

- 5. Multiple evaluations tend to create,undue levels 

of negative emotions (e.g., hostility or anxiety) 
in project personnel. ll 50 27 3 

6. Because of multiple evaluations people are quit- 
ting projects whenever possible. 3" 14 B 8 

Data Interpretability 

7. When multiple evaluations occur almost con- 

currently the earlier ones rarely contaminate the 
data in later ones. S21. Sloe 

8. Multiple evaluations do not affect the interpret- 
ability of the data they produce. 4° O17 12: 

9. Multiple evaluations have definite negative 
effects on the performance of projects. 5 34 41 4 

10. Data from multiple evaluations are more difficult 
to analyze than data from a single evaluation. 6 43 30 2 

Evaluation Design 

11. Evaluation studies should be built into program 

(e.g. Special Education, Educational Media, etc.) 

planning in such a manner as to avoid multiple 
evaluations, gausar — 12-730 

12. The existence of multiple evaluations increases 

the importance of unobtrusive (indirect) measures 
for gathering data. тз 22 9 

13. If spaced far enough apart, multiple evaluations 
do not have interactive effects. 18 36 2 


of multiple evaluation on project personnel morale, data inter- 
pretability, and evaluation design. No attempt was made to dis- 
guise the nature of the questions and a straight-forward direct 
measurement instrument was used. Of the directors sampled, ар- 
proximately one-half indicated that their project was undergoing 
more than one evaluation. It did not seem unusual for projects to 
have more than one evaluation usually being conducted simul- 
taneously by different evaluation groups. Directors were instructed 
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to respond to the 13 items by checking the appropriate response 
on a 4-point Likert scale ranging from strongly agree to strongly 
disagree. The results of the survey are also contained in Table . 
1 expressed as percentage of response made to a given point on 
the Likert scale. An average of 14 per cent of the respondents did 
not answer a given item. This omission rate accounts for the. 
discrepancies between the total of the four percentages and a 
theoretical 100 per cent, which can be seen from inspecting these 
data. 

Findings. Project directors felt that multiple evaluations had 
some effect on morale and data interpretability, as well as im- 
plications for evaluation designs and their validity. About half 
the directors disagreed with the statement that multiple evalua- 
tions had little effect on morale of project personnel. In general, 
morale problems were seen as affecting both confidence in evalua- 
tion generally and the cooperativeness of project personnel to- 
wards evaluators. In addition, multiple evaluations were perceived 
to create undue levels of negative emotions. While multiple evalu- 
ations seemed to create some morale effects they were not viewed 
as a major cause for people quitting a project nor as creating 
unwillingness of project personnel to engage in new projects. 

Most project directors felt that multiple evaluations do affect 
the interpretability of data that they produce, and that multiple 
evaluations are more difficult to analyze than single evaluations. 
Even if the evaluations occur almost concurrently, there was still 
the feeling that the earlier ones contaminate data in later ones. 

There was a slight tendency to believe that multiple evaluations 
do not have negative effects on the performance of the projects. 
Delineating more carefully possible negative effects could be a 
feasible future project in this area. It is possible that the present 
question of the investigation was too ambiguous to get a clear 
Teading as to effects of multiple evaluation on project performance. 

In terms of effects on evaluation design there was definite agree- 
ment that evaluation studies should be built into a program in 
Such a manner as to avoid multiple evaluations. In conversations 
With some project directors the investigators observed that di- 
rectors were concerned with multiple evaluations being performed 
by different groups and thereby contributing to an atmosphere of 
Confusion of purpose. 
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This perception would be consistent with the reaction to item | 
number 11. There was also general agreement that multiple evalu- _ 
ations increase the importance of indirect measures for gathering | 
data. Responses to the final question indicated that even if spaced: 
far enough apart multiple evaluations do have an interactive effect. н 
This perception is consistent with the response that contamination 
may still occur between evaluations even if they are made almost 
concurrently. 

Another trend noticed in these data was for project me 
with more than two or three evaluations to respond with even , 
greater percentages in the directions noted above. This trend held — 
true for 10 of the 13 items. Thus, as the number of evaluations 
increased, project directors tended to feel even stronger in the 
direction already indicated. The most dramatic shift as a function 
of the number of evaluations was for the last item concerning 
spacing multiple evaluations. In the case of those directors ex- 
periencing four or more evaluations somewhat more than 40 per 
cent indicated that they strongly disagreed with the item. (Thirteen | 
per cent of the directors were undergoing four or more evalu- - 
ations.) 

Discussion. From these descriptive data it can be seen that E i 
tiple evaluations do have reactive effects on the project directors 
undergoing them. The full extent of these effects on the validity of 
evaluation studies needs to be studied further. Two design strategies 
for dealing with reactivity to multiple evaluation were suppo 
by the project directors sampled. One strategy would be to em- 
phasize unobtrusive or indirect kinds of measures. There are ob- 
vious disadvantages to this strategy including the ethical problems. 
in taking measurements without respondents being aware and in 
developing valid hidden measures. The other strategy would be to - 
work out designs that would incorporate evaluations without multiple 
measures being taken. Some of the designs indicated in the Camp- 
bell and Stanley (1963) paper could be useful here in improving 
the validity of evaluation studies. The simple but effective strategy _ 
of spreading a pool of respondents into random groups and of | 
sampling subgroups of this pool at different times is also а way 
of working around repeated measures and for minimizing sources 
of invalidity. 

There are also implications for basic policy decisions regarding 


| 
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the number of different evaluation groups ог teams that might be 
deployed on a single project. If evaluation were incorporated into 
the plan of the project at an early stage it might be possible to do 
away with some of the evaluation teams. However, whatever 
strategy or design is used to deal with the problem of multiple 
evaluation and subject reactivity the whole topic needs to be better 
understood. It is quite possible that there should be minimal use 
of deception or indirect measurement and that investigators take 
the subject into their confidence and explain what it is they are 
trying to assess. Such an open strategy could be used both for 
learning more about subject reactivity and for the actual gather- - 
ing of the evaluation data. Р 

This strategy might also improve the validity of evaluation 
studies. In any event, the alternative strategies proposed should be 
carefully explored and compared with one another in relation to 
the extent to which they can maximize the validity of evaluation 
studies. Admittedly the criteria of effectiveness needed to establish 
the validity of evaluation studies are difficult to identify espe- 
cially if circularity in reasoning is to be avoided. The task of 
minimizing reactive effects on subjects requires systematic efforts 
on the part of research specialists who are endeavoring to improve 
the internal and external validity of evaluation studies. 
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USING LINEAR PROGRAMMING FOR PREDICTING 
STUDENT PERFORMANCE 
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JAMES P. IGNIZIO 
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Tuis paper discusses a technique for using College Entrance 
Examination Board (CEEB) tests (or results of other entrance 
examinations) in the prediction of the future performance of a 
college student. The procedure is illustrated by the results of an 
actual study which took place at Virginia Polytechnic Institute 
and State University (VPI and SU) of Blacksburg, Virginia. 

One well known and often utilized method employed in such 
analyses is the method of least squares (Neville and Kennedy, 
1966; Freund, 1967). Most previous efforts in this area have used 
this technique; however, it suffers from several disadvantages, 
two of which may be noted: 


1. The method of least squares is biased by extreme cases. The 
method employed within this paper pays less attention to 
these “outliers,” since it does not square the residuals (ie., 
differences between measures of actual student performance 
and those of predicted performance). 

Sensitivity analysis on the data generated by the method of 
least squares is relatively difficult and time consuming. The 
method employed in this study, on the other hand, can readily 
perform comprehensive analyses on the consequences of any 
changes (e.g., additions and deletions) of the variables in- 
volved. 


ю 


The technique used in this paper to generate predicting equations 
397 


398 EDUCATIONAL AND PSYCHOLOGICAL MEASUREM 


(geometrically represented by a hyperplane of “best” fi 
the well established technique of linear programming (Co 
1970; Thompson, 1971). Linear programming, which is a 
used mathematical tool, is almost always available as a | 
the library of existing computer systems. The technique | 
used to solve systems involving up to several thousands 
ables (student scores in the particular study under considera 

In this study, the usual technique of least squares was 
and compared with the results obtained by linear programn 
The remainder of this paper discusses (briefly) these two mt 
and then describes and compares, in detail, the results ol 
via each method. 

Formulation of the problem. Fifty students in the top te 
their high school graduating class were used as a sample b 
that is the group from which VPI and SU honor students 
selected. Moreover, since it was believed that high school га 
possibly the most significant factor in predicting college perfi 
ance, it was desired to hold this factor constant in order to 
centrate on the pertinence of CEEB Scores. These students 
VPI and SU in the fall in 1964. The Mathematics and У 
portion of the SAT were averaged, their CEEB асће 0 
tests (Mathematics and English Composition) were averaged, а 
their quality credit average (grade average) for their fre 
Year was recorded. Let these variables be designated as a1, da, 
Qa, respectively. 

The problem then was to develop an equation which “b 
Predicts the expected grade average of a student when given 
values of a; and a». Alternately, this problem is to develop а p 
of best fit, given by the equation: 


Qr = aii + ast, +b 
where: У 


Q» = the predicted grade average 

a, = the average of the CEEB SAT 

a, = the average of the CEEB achievement tests 
and ту, 2; and b are to be determined. 

Once & method has been found to determine the values of 2 
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and 6, equation (1) can be used to predict the scores of any future 

sample of students given the results of their College Board scores. 
The least squares method. The objective of the least squares 

technique is to minimize the squares of the residuals (r2), that is 


minimize У) (r)? 
iml 
Since methods for accomplishing this objective are well known 
(Freund, 1967), they will not be discussed here. For the sample 


of 50 students studied least squares produced the following equa- 
tion: 


Оь = —0.0335a, + 0.0421а, + 2.31 (2) 
Consequently, using this result, one could predict the perform- 
ance of a new student as follows: 
Example: 


Suppose that a new student achieves a 45 on the aptitude test 
and a 50 on the achievement. His predicted grade average 
at the end of his freshman year would be: 


Qp = —0.0335(45) + 0.0421 (50) + 2.31 = 2.91 (3) 
(where A = 4.00). 

The linear programming method. Since, as mentioned previously, 
the least squares method is biased by extremes, one can restate 
the objective as to minimize the sum of absolute deviations 
(||) rather than the sum of the squares of the deviation (72). 
The predicting equation based on this objective will then pay 
less attention to extreme cases. 

The actual grade average (Ол) for each student is equal to the 
Predicted average (Q,) plus the residual (т), or for “n” students: 


до tn = Qi 
Qro + = Qa (4) 
Ор» +r = до 


where: 


Оь» = the predicted grade for student “j” 
Ол» = the actual grade for student “j”. 
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The linear programming problem can then be stated as: 


Minimize f = || + [rol + ‘°° + М 

Subject to the system of equations in (4), above. 
Equations (5) and (6) are not yet in the standard form 
required for solution by linear programming. However, Converse | 
(1970) shows how one may easily transform the problem i 
standard form via simple transformations. Solving the transf 
problem by linear programming we obtain the new Е 
equation: ў 

6) 


©» = —0.0179a; + 0.0520а, + 0.905 


Using the previous example of a student with a, = 45 an 
аз = 50, we obtain a predicted grade of Ор = 2.70, when subs 
tuting into equation (6) above. р 

Observations. For the original sample of fifty students in 1964, 
the absolute sum of all 50 residuals for the least squares case 18 
24.65 versus 23.61 for linear programming. Consequently, equa- - 
tion (6), the prediction equation obtained via linear programming, | 
can be seen to pay less attention to extreme cases than does | 
equation (2), the one based on least squares prediction. 

With regard to the actual test results, it is very interesting t 
observe that the coefficient of the average CEEB SAT score js 
negative. Although the sample is small and restricted to high 
ability students, this negative weight implies that an inve 
relation appears to exist between average CEEB SAT scores а 
later academic performance. Such an observation warrants further 
attention. 

To validate the prediction equations (equations 2 and 6), 4 
more recent sample of VPI and SU Freshman students was in: 
vestigated. One hundred and eleven students (again in the upper 
tenth of their high school graduating class) entering VPI and 80 
in the fall of 1970 were sampled. Equations (2) and (6), derived 
from the original sample of VPI and SU Freshman students 
(1964), were used to predict the grade average for this new | 
sample group. The results once again showed that the linear pro- — 
gramming prediction equation gave the smallest absolute sum © 
the 111 residuals (53.53 versus 57.26 for least squares). Also, — 
once again the inverse relation of SAT scores appeared. : у 
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there was no change in the efficiency of the linear programming 
prediction between the 1964 and 1970 sample. 

Summary. The results of this test study strongly indicate that 
the generation of a prediction equation for future student perform- 
ance can be achieved easily by the readily available method of 
linear programming. Further, the predictions, using this equation, 
will be far less biased by extreme cases and thus should offer а 
more valid estimate of expected student performance than that 
given by a least squares approach. 
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QUANTITATIVE FACTOR SCORES AS PREDICTORS 
OF GENERAL ACADEMIC PROMISE 


PAUL 8. BURNHAM лмо BENJAMIN A. HEWITT 
Yale University 


Scores on pre-college mathematics achievement tests have tradi- 
tionally been regarded as offering evidence of readiness to pursue 
work chiefly in mathematics and the quantitative sciences. Re- 
cently, however, we have seen evidence that scores on the College 
Entrance Examination Board (CEEB) Mathematics Achievement 
Tests and on the Graduate Record Examination Quantitative 
(GRE-Q) section offer important information predictive of gen- 
eral academic success at the undergraduate college and graduate 
school levels respectively. 

Group comparisons. In an unpublished pilot study by R. R. 
Ramsey, Jr., formerly Director of Undergraduate Admissions at 
Yale, the numerical grade averages for the first semester of fresh- 
шап year (Yale College Class of 1969) were distributed and the 
top two per cent and the bottom two per cent were identified. 
Twenty-two cases comprised each group. A summary of the 
CEEB scores of these two contrasting groups is shown in Table 1. 
While desirably the number of cases should be larger, neverthe- 
less considerable evidence of differentiation is found in the mean 
differences and in the comparative distributions of the high (H) 
and low (L) groups. The H group exceeded the L group by 42, 61, 
and 29 mean score points on the CEEB Scholastic Aptitude Test 
—Verbal and Mathematics sections (SAT-V, SAT-M) and the 
English Achievement Tests, respectively. 

In noting the scores and the mean differences on the Mathe- 
matics Achievement Tests, one should recall this as the transition 
year in the CEEB’s Mathematics Testing Program from an 
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Intermediate and an Advanced Test, to a so-called Standard 
(Level I) and an Intensive (Level II) examination. Two items of 
pertinent information stand out: (1) since only five of the L group 
but 11 of the H group took the Advanced Test, self-selection 
tended to identify the potentially superior student; and (2) this 
phenomenon of self-selection was further implied by the numbers 
taking the lower level (Standard) test: twelve of the L group 
contrasted with only four of the H group. The fact that the Chem- 
istry examination (not shown in Table 1) was elected by nine 
of the 22 High and by only four of the Low group students is 
another interesting reflection of self-selection. 

More recent pilot study information is presented in Table 2 
for the Class of 1972, which matriculated after the Yale College 


faculty had changed the grading system from a numerical to an . 


honors, high pass, pass, and fail pattern. Small H and L per- 
formance groups were identified by determining from their grades 
for both terms of freshman year the 10 students who had earned 
the largest number of honors grades and the 10 who had the 
most failures. Superiority of the H over the L group is again 
apparent in the conspicuous mean differences in the SAT-V, 
SAT-M, and English Achievement Test scores. Again, self-selec- 
tion with respect to Mathematics Achievement scores seems to 
characterize the students who comprised the high group: eight 
of the H group but only one of the L group took the Intensive 
Mathematics Achievement Test. On the other hand, five of the 
Н group and 10 of the L group took the Standard Mathematics 
examination with the mean difference in scores being 167. 

As in Ramsey's data, evidence of self-selection is also apparent 
in the avoidance of Achievement tests in the Sciences. Eight of 
the High group but only two of the Low group offered scores in 
either Chemistry or Physics. 

Correlation evidence. To explore more fully the relationship 
of general college achievement to admission test scores, we COI- 
related the number of Honors grades earned by each member of 
the Class of 1972 with his scores on (a) the SAT-V, (b) the 
Intensive and (с) the Advanced Mathematics Achievement Tests, 
and (d) a single combined index derived from his scores on the 


SAT-V, SAT-M, and his CEEB Achievement Tests. Results 87€ 
shown in Table 3. 
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Despite the restricted range of CEEB scores, correlations be- 
tween these measures and number of Honors grades proved to 
be statistically significant and thus supported the bi-polar group 
comparisons shown in Table 2. Particularly noteworthy is the 
finding that scores on the Intensive Mathematics Achievement 
Tests, despite marked restriction in range, correlated nearly as 
well as did the SAT-V with our criterion of general academic 
success (coefficients of .30 and .32). If all 972 who took the 
SAT-V had also elected the Intensive Mathematics test the validity 
coefficient might well have been in the range .70-.75, the estimate 
obtained by correcting for restriction in range. Furthermore, while 
the average freshman achieved 2.2 honors grades, those who took 
the Intensive Mathematies test achieved 2.7 honors grades—an 
outcome thus showing further evidence of self-selection. 

At other institutions. Important implications derive from relat- 
ing these findings to published data from other institutions. The 
California Institute of Technology (CIT), which required all 
applicants to take the Advanced Mathematics Achievement Ex- 
amination, reported in the Manual of Freshman Class Profiles 
(CEEB) certain score distributions of their matriculants. Ninety- 
nine per cent of their scores on the Advanced Mathematics Tests 
were in the 700-800 range; additionally, 38 per cent of CIT 
matriculants scored 700-800 on the SAT-V while the correspond- 
ing percentages for Harvard and Yale were 30 and 31 respectively. 
Similarly, Massachusetts Institute of Technology’s requirement 
of a Mathematics Achievement score did not prevent recruitment 


TABLE 3 
Correlation of CEEB Scores with Number of Honors Grades Earned 
During Freshman Year 
(Class of 1972 Yale College) 
=—_— —————— 
Means and Standard Deviations 
Correlations CEEB Honors 

College Board Examinations в r М SD M SD 
SAT-V 972 32 68 18 2.2 2.5 
Combined Aptitude and 

Achievement Tests 972 41 684 57 2.2 2.5 
Math. Achievement 

Standard (1) 612 .26 685 75 1.9 2.4 

Intensive (IT) 336.30 750 55 РЕЛЕ ТИ 


* Bi ice 112 students took both levels of the Mathematios Achievement Test the number of 
ita ures 


studen Mathematics was 836. 
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of an otherwise able group of matriculants as indicated by the 
fact that some 36 per cent had SAT-V scores in the 700-800 
range. These data are summarized in Table 4, where they are 
compared with the corresponding Harvard and Yale statistics. 

Thus, the CIT requirement that all applicants present scores on 
Level II of the Mathematics Achievement Test has not prevented 
the recruitment of a class talented in both verbal aptitude and 
mathematics achievement. On the other hand, the Harvard and 
Yale permissive system of allowing each applicant to elect achieve- 
ment tests of his choice and to avoid an examination in mathematics, 
seems to deprive the college of the optimal assessment of the 
academic potential of its applicants. Only one third of Yale’s 
recent matriculants typically offer Level II Mathematics Achieve- 
ment scores; comparable Harvard data are not available. 

The graduate level. The importance of quantitative reasoning 
skill in achieving the PhD degree is implied by the data in Table 
5. Mean GRE-V (Graduate Record Examination—Verbal section) 
and GRE-Q scores of 1337 Yale Graduate School Matriculants 
1955-1959 are compared with the mean scores of those who 
succeeded in subsequently earning the PhD degree. The Verbal 


TABLE 4 
College Board Scores of Matriculants to the Class of 1968 


California Institute 


of Technology Harvard Yale 
Ady. Adv. Adv. 
Math., Math. Math. 
SAT-V Асһ. SAT-V Ach. SAT-V Ach. 
Score % % % % % % 
700-800 38.2 99.0 29.5 МА» 30.8 39.6 
600-699 51.0 1.0 44.2 МА 51.9 17.4 
500-599 9.8 24.5 МА 15.2 2.5 
<500 or not 
available 1.0 18 МА 2.1 40.5 
у 100.0 100.0 100.0 100.0 100.0 
Applicants 1213 5643 6041 
Admitted 310 1438 1547 
Matriculants 204 1201 1061 
Matric./Adm. 66% 84% 69% 


* NA: Data not available. 


b Among the 430 matriculants in this category, only two had taken the Advanced Mathematics 
Examination. 


= 
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means of PhD recipients differ negligibly from those of all 
registrants; on the other hand, the Quantitative mean of all PhD 
recipients was 15 points higher than that of the registrants. 
Within the divisions, comparable increases ranged from a low 
of three points in the Biological Sciences to a maximum of 24 
points in the Social Sciences. These consistent differences are 
small, but they may be important. 

Concluding statement. Those who have regarded Mathematics 
Achievement scores as chiefly relevant to college work in mathe- 
matics and the quantitative sciences should take a fresh look at 
the scores and subsequent college achievement of students. The 
evidence suggests the importance of Mathematics Achievement 
scores as predictors of general academic success at the under- 
graduate level. The applicant who scores high in both verbal apti- 
tude and mathematics achievement is likely to be the high-level 
generalist who is really free to choose among the many areas 
of study. He can select on the basis of positive interest rather 
than avoidance of difficulty. In addition to sheer competence, it may 
be that a high mathematical achievement score suggests certain 
desirable personal qualities, such as analytical skill, perseverance, 
or well organized behavior patterns which are distinct assets in all 
fields, 

Admission policies permitting candidates to select freely the 


TABLE 5 


Mean GRE-V and GRE-Q Scores of Yale Graduate School Registrants for 
the PhD Degree 1956-1959 and of those Subsequently Awarded the 
PhD Degree, Classified by Division of Study 


PhD Recipients 


PhD Registrants. % Regis- 
Divisionof Study N GREV GREQ tns GREV GREQ 


Humanities 66 6з 579 мани C ДАН 
Biological Sciences 200 641 687 в 
Physical Sciences 283 628 705 00 ER 
Social Sciences 228 660 624 = ie E 
"Total Group 1337 661 622 62 eet gr 
Scores Not 

Available 
Total PhD Lud 

Registrants 1748 


Note.—Among total registrants 60 per cent earned the PhD while among those for whom GRE 
Scores were available the comparable figure was 62 per cent. 
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achievement tests of their choice tend to discriminate against i 
candidates who elect tests in difficult areas. By permitting the 
weak applicant to conceal the extent of his weakness such policies 
can do a disservice to the college seeking to select the best students 
in a competitive situation. 

Additionally, colleges and universities might properly require 
achievement tests in mathematics and possibly the sciences if they 
are intent on selecting a freshman class composed of students who | 
аге іп fact free to choose their prospective majors on the basis of | 
positive interest. Much is heard about students from minority and — 
foreign groups who are penalized by tests of verbal skills, While 
we know all too little about the mathematical handicap of minority 
groups, we are familiar with the many instances of foreign students 
whose mathematical scores are high despite their low performance _ 
on verbal tests. г 

At both the undergraduate and graduate levels, the data imply 
the desirability of considerable further research to explore more | 
comprehensively the importance of quantitative abilities in pre- 
dicting successful completion of the BA and PhD degrees. 
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CONTRIBUTIONS OF SELECTED TRANSCRIPT 
INFORMATION TO PREDICTION OF LAW 
SCHOOL PERFORMANCE 


RICHARD R. REILLY 
Educational Testing Service, Princeton 


Тт is not surprising that past academic performance has often 
been found to be the best single predictor of future academic 
performance. Admissions offices in graduate and professional 
schools have long recognized this fact, and as a result virtually 
all schools require complete records of previous performance in the 
form of transcripts. The use then made of this transcript infor- 
mation may depend upon individual admissions offices, but judging 
from statements in college and graduate school catalogues and 
from most published prediction studies one overall index of in- 
dividual performance such as rank-in-class or GPA is given heavy 
weight in admissions decisions, while more specific information is 
often largely ignored. It seems plausible, however, that a more 
careful breakdown of the undergraduate record might lead to in- 
creases in predictive accuracy. This potentiality may be especially 
true in the professional and graduate school settings where specific 
groups of under-graduate courses can be judged as being more or 
less relevant to graduate study in a given area. On a conceptual 
level, at least, grades in under-graduate biology courses should 
be more highly related to medical and dental school studies than 
grades in, say, English literature. The usual cumulative GPA, of 
course, does not include any a priori weighting of subjects with 
respect to their relevance for any particular field, but for most 
graduate and professional fields a number of specific hypotheses 
could be generated as to which courses or items of information 
might be most important or relevant. 
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Purpose. Although the present study was exploratory in nature, 
its major purpose was to investigate whether any increase in pre- М 
diction of law school performance could be effected by a more 
thorough consideration of the undergraduate record. It was also 
hoped that this study would suggest other hypotheses for future 
research. 

Method. The sample consisted of 134 first-year law school 
students from school A and 85 first-year students from school 
B. The following variables were extracted from the students’ un- 
dergraduate transcripts. ~ 

The first five variables were dummy variables denoting a spe- 
cific category. Students who fell into the category were given a 1; | 
students who did not, a 0. 

1. Major in Humanities (Maj Hum) included all students 
majoring in English, languages, philosophy, theology, speech 
dramaties, or related subjects. РИ 

2. Major in Social Sciences (Maj 88) included students major= 
ing in economics, history, political science, business administration, 
geography, sociology, anthropology, or related subjects. k 

3. Major in Science! (Maj Sci) included all students majoring 
É el chemistry, biology, psychology, geology, or related |, 

elds. ? 

4. Changed Major (Cha Maj) included all students who changed 
their major at least once during their undergraduate careers. * 

5. Year Graduated (YG): all students graduating in a year 
earlier than 1969 were given a 1 on this variable. 

The next nine variables were based on grades in specific courses 
or years. Since the undergraduate colleges involved employed 4 


variety of grade scales, all grades were converted to a 0-4 (low- 
high) scale for study purposes, 


6. Cumulative GPA for four years (GPA). |: 

7. Average GPA in Humanities (Hum GPA) (ie, average 
grade in all courses falling into the area described in variable 1). 

8. Average GPA in Social Sciences (SS GPA) (ie, average 
grade in all courses falling into the areas described in variable 2)." Y 


2А fourth major group was Quantitative and Technical which included 
lated felis. Studenta falling into this по, computer sciences or e 
lated fields. Studeni g into this fourth identified by 
zeros on the first three dummy variables. с ону шш 5 
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9. Average GPA in Sciences (Sci GPA) (ie. average grade in 
all courses falling into the areas described in variable 3): 

10. Average GPA in Quantitative and Technical (QT GPA) (i.e. 
average grade in all courses falling into the areas described in 
footnote 1). 

11. Average in Major Subject (Maj GPA) (ie, average over 
all years for courses in major subject). 

12. Third-year GPA minus first-year GPA ( (3-1) GPA). 

13. First-year GPA (1 yr GPA). 

14. Second-year GPA? (2 yr GPA). 

The next set of transcript variables consisted of five product 
terms where in each case one factor was а dummy variable de- 
scribed earlier and the other factor was & quantitative variable. 
‚ > The final “transcript” variable included was the mean LSAT score 
of all candidates taking the LSAT during 1968-1970 who attended 
| |, the college from which the transcript was received. This was in- 
ў i tended to serve as a very rough indicator of school quality. 

15. Variables 1 X 6 (Hum X GPA). 
16. Variables 2 x 6 (SS х GPA). 
17. Variables 3 х 6 (Sci X GPA). 
18. Variables 5 x 6 (YG X GPA). 
19. Variables 4 X 12 (Cha Maj х (3-1) GPA). 
20. College LSAT Mean (LSAT-M). 


i Two additional independent variables, Law School Admission 
| Test (LSAT) score and Writing Ability (WA) scores, were included 
in the analyses. First-year law average (FYA) served as the 
dependent variable in both schools. 
Two points related to the selectio 
script information for purposes of this study should be clarified. 
First, it should be recognized that the major subject categories, 
which are somewhat arbitrary, certainly should not be taken to 
reflect any rigid preconceptions held by the author as to the in- 
terests, aptitudes, or abilities called for by each. Actually, the 
у Categories are quite similar to those used by Cartter (1966) in 
‚ his study of academic quality of graduate schools, except that his 


п and combination of tran- 


arate variable for the same 
о denote majors in quantita- 
d have been redundant and 


2Third-year GPA was not included as а sep 
у reason а fourth dummy variable was not needed t 
~ tive and technical areas. I.e., the information woul! 
m would also have made the data matrix singular. 
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two categories of biological and physical sciences were combin 4 
into one science category and mathematics and accounting were - 
classed with engineering in a quantitative and technical category. 

A second point concerns the inclusion of the five product term _ 
variables. Since the planned mode of analysis was that of multiple | 
regression, it was decided to use a form of polynomial regression 
to allow for the possibility that different regression slopes might 
be required for individuals in different groups for certain predic- 
tor variables. It may be helpful for the reader to note that the 
results obtained when such terms are entered in a multiple те- 
gression format are similar to the results of a test of equality of _ 
slopes by means of analysis of covariance and can, in fact, be _ 
made directly equivalent to the latter (Cohen, 1968). Direct 
equivalence was not the case in the present study, since all of the „ 
cross-product terms were entered in with all other variables ша 
stepwise regression procedure. Retention of one ог more of these 1 
terms by the stepwise procedure would suggest that group теше 
bership might serve as а moderator variable (Saunders, 1956). A - 
complete model for studying the moderating effects of group mem- 
bership on predietion would have meant including every possible. 
cross-product of the dummy variables with the continuous pre-- 
dietor variables. In the present ease this would clearly have re- _ 
sulted in an unwieldy number of predictors. For this reason it was _ 
decided to limit the cross-product terms to five of the most h 
pothetically tenable. М 


Results and Discussion 


The intercorrelation matrices? are shown in Tables 1 and 2. I 
Table 2 the year graduated variable and the product of у 
graduated and cumulative GPA were not included because all fi 
year students at school B graduated in 1969. 

In both schools the FYA criterion correlated most highly with. 
LSAT scores, and social science grades correlated more highly with 
КУА than did cumulative GPA. One other notable observation 


з Although third-year average was not included in the original data mal x 


Íor reasons noted earlier, the correlations of 5 " ird-year 
averages were estimated by the relationship: of all variables with ird 


7101 + Tasos 
тыз = пр S 
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can be made from Tables 1 and 2. In both schools cumulative 
GPA was negatively correlated with LSAT and school mean 
LSAT. In fact, with the exception of SS GPA which had a very 
modest positive correlation (.05) with LSAT-M in School B and 
QT GPA with a similarly modest correlation (.02) with LSAT in 
School A, GPA predictors correlated negatively with LSAT-M 
and LSAT in both samples. A similar finding was noted in a recent 
study of psychology graduate students by Hackman, Wiggins, and 
Bass (1970), in which GRE scores were negatively correlated 
with both undergraduate GPA and a subjective quality rating of 
the students’ undergraduate institution. Given a sample of stu- 
dents within a fairly restricted ability range one might expect 
lower GPAs for students from more prestigious institutions with 
а resulting negative correlation between the index of school 
quality and GPA. Astin (1969) has reported data indicating that 
among undergraduates individuals at a given level of ability 
typically receive lower grades in selective than in unselective 
colleges. The negative correlations between LSAT scores and GPA 
are a bit more puzzling since one would normally expect these 
variables to be positively related. An explanation may be found 
in the procedures used to accept students from the larger appli- 
cant population. If both GPA and LSAT were given roughly equal 
weight in accepting candidates, for example, and most candidates 
with high scores on both variables either did not apply or did 
not choose to come, the resulting sample of accepted students could 
have included a much higher proportion of candidates with dis- 
crepant scores (ie. high scores on one variable and low scores 
on the other) than is true of the general candidate population. 

The results of the stepwise regression are presented in Table 3 
with variables ranked in order of their selection. The stepwise 
Procedure selected all variables resulting in an increase of at least 
001 in the squared multiple correlation. Because this is a rather 
liberal criterion the variables resulting in significant (p < .05) 
increments in the squared multiple R have been asterisked. In 
both schools the first two variables selected were LSAT and social 
Sciences GPA, the latter variable barely reaching significance in 
school B. The third significant variable resulting in schoo] A was 
a "moderator variable,” i.e., the cross-product of the dummy var- 
iable denoting year graduated and cumulative GPA. A test of the 
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Intercorrelations among Predictor and 
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hypothesis of equal regression slopes of FYA on cumulative GPA 
in the two groups (i.e. those graduating in 1969 and those gradu- 
ating earlier) yielded an F value of 6.706 with 1 and 131 degrees 
of freedom which is significant beyond the .02 level. 

Although the model used here was technically different, this 
finding supports previous evidence that age (which quite clearly 
is highly correlated with year graduated) is a moderator variable 
in the law school (Klein, Rock, and Evans, 1968) and Business 
School (Pitcher and Smith, 1969) settings. The prediction equa- 
tion for school A‘, considering only the significant variables of 
Table 3, can be expressed as: | 

Predicted FYA = 0293 LSAT + 2.8237 SS + .72485 СРА + 
44.6789, where 8 = 1 for individuals graduating before 1969 and 0 
otherwise. 

It can be seen from this equation that 94 LSAT points have 
approximately the same effect on predicted КУА as а unit in- 
crease in social science GPA and that a small positive adjustment 
based on GPA is made for pre-1969 graduates. This “adjustment” 
factor can be considered in light of the Pitcher and Smith data 
which suggested that older students are underpredicted when & 
regression equation derived on all students is used. 

The equation for school B is: 

Predicted FYA = .0014 LSAT + 2143 SS + 1.2900. 

In this case a unit increase in social science GPA has about the 
same effect as 153 LSAT scaled score points. It was unfortunate 
that the YG x GPA term could not be studied in school В be- 
cause of the lack of variation mentioned earlier. ЫШ 

The results of a second pair of stepwise regression analyses 
from which the test score variables were excluded are shown in 
Table 4. Examination of Tables 3 and 4 enables conclusions to be 
drawn with respect to some of the implicit hypotheses underlying 
the selection of transcript variables for the study. First, the major 
subject studied by students does not арреаг to be useful in po 
ion for predicting FYA. In the present study the majority of 
students majored in social sciences, and consequently there mae 
little variation on each of the three dummy variables denoting 
major subject area. The possibility exists, however, that some other 
Е for school A ranged from 50 to 80, while school В operated 
With the more common 0-4 scale. р 
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system of polychotomization might have produced more positive 
results. 

None of the three yearly grade averages appeared to be a better 
predictor of FYA than cumulative GPA, and the degree of im- 
provement shown by & student from the first to third years of 
college also failed to add much to prediction. On the other hand, 
breaking GPA down by subject area does appear to be potentially 
useful, Social science grades, in particular, appear to show promise 
as a predictor and should be examined in further research. Of 
the cross-product terms, only one, YG X GPA, was included 
among the variables; it added significantly to the squared mul- 
tiple R, and this result has been discussed above. | 

It is evident from Tables 3 and 4 that cumulative GPA is not 
among the more prominent contributors to prediction and that 
social science GPA appears to be the single best grade variable 
for predicting FYA. This result as well as all other results re- 
ported in this paper should be interpreted however, with caution. 
The real usefulness of any of the variables studied cannot be 
fully known without some estimate of the effects of selection on 


the study samples. It is possible that selection attenuated the pre- 


dictive power of cumulative GPA relative to the other variables. 
data for the entire pool of 


Further research is planned in which 
applicants to a given law school will enable a more nearly clear 
assessment of the effects of range restriction. 

Summary and conclusions. Selected transcript variables were 
analyzed along with LSAT, WA, and cumulative college GPA in 
an effort to determine whether any of the transcript variables could 
effectively increase predictability of FYA. LSAT proved to be 


the best single predictor of FYA in the study samples, but two 


especially promising transcript variables, social science GPA and 
ed. Because of the effects of 


a moderator variable, were identifi 

selection, caution was urged in the interpretation of results. It 
is suggested, however, that further research be conducted on the 
relationship of some of the more promising transcript variables 
to law school performance and that provisions be made in such 
research for the collection of data from the complete applicant 
pool of the study schools so that range restriction corrections 
might be applied. This would enable a more penetrating assess- 
ment of the usefulness of each predictor than was possible in the 
Present study. 
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ENGLISH PROFICIENCY, VERBAL APTITUDE, AND 
FOREIGN STUDENT SUCCESS IN AMERICAN 
GRADUATE SCHOOLS 


AMIEL T. SHARON 
Educational Testing Service 


Tum admission of foreign students to graduate study in the 
United States is а complex problem. Unlike their American coun- 
terparts, foreign students often lack proficiency in the English 
language and have different language and cultural backgrounds. 
Furthermore, undergraduate record, which generally has been 
Íound to be the best predictor of graduate school success, 18 diffi- 
cult to evaluate for the foreign student. The lack of comparability 
in the grading systems of universities in different countries makes 
it impossible to employ the prediction approach used with. Ameri- 
can students. The appraisal of the foreign candidate’s aptitude for 
graduate study by standardized admissions tests also has pitfalls. 
Poor performance may be due to factors not directly related to 
aptitude for graduate study. For example, the nonnative examinee 
may lack adequate English proficiency to understand the test 
questions or he may not be familiar with the philosophy or 
method of American objective tests. 

Competence in the English language is one factor which has 
been assumed to be crucial for the success of the foreign student 
Studying at an American university. It is difficult to imagine how 
а student can learn in an American graduate school without being 
able to read, write, and comprehend in the English language. Thus, 
English proficiency might be thought of as & necessary, although 
not sufficient, prerequisite for graduate school. success. For this: 
Teason many graduate schools recommend or require that their 
foreign applicants take the Test of English as a Foreign Language 
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(TOEFL) in their native country prior to coming to the United - 
States. 
TOEFL is designed to help foreign students demonstrate their 1 
English language proficiency at the advanced level required for NX 
Study at American colleges and universities. The test measures р 
five important language skills: Listening Comprehension, English _ 
Structure, Vocabulary, Reading Comprehension, and Writing _ 
Ability, | 

A requirement of many graduate schools of all their applicants _ 
is that competence for advanced study be demonstrated by suc- 
cessful performance on the aptitude test of the Graduate Record | 
Examination (GRE). The two aptitude tests, Verbal (V) and _ 
Quantitative (Q), are designed to measure mental capabilities Я 
though {о be important in graduate level study. They are not _ 
achievement or proficiency tests which require knowledge in any | 
specific subject matter. Instead, they attempt to measure reading ie 
comprehension and logical reasoning with both verbal and quanti- И 
tative material. SM 

There is no information at the present time that would indicate У 
how the combined scores on TOEFL and GRE are, or should be, 
used by graduate schools for selecting foreign students. Since _ 
TOEFL is labeled a “proficiency” test and the GRE an “aptitude” 
test, it is logical to assume that the two tests yield different sorts” 
of information about a candidate. For that reason, the combina 
tion of the two tests could result in а more accurate prediction of 
academic achievement than that afforded by either test alone. | 

The general purpose of this study was to determine whether _ 
TOEFL adds to the predictive validity of the GRE Verbal test. _ 
More specifically, it was hypothesized that TOEFL would act 88 " 
& "moderator" of the relationship between GRE-V and a пез 
of graduate school performance in the sense that students scoring 
high on TOEFL would be more predictable by GRE-V than those _ 
scoring low. It would seem reasonable to assume that if an in- | 
dividual does not have adequate English proficiency, a verbal” 
aptitude test could not accurately predict his scholastic achieve 
ment, The practical implication may be that the GRE-V score 0 
a foreign applicant with a low TOEFL score should be ignored. + 

Method. In order to avoid the burdensome testing of students | 

with the two examinations under consideration, an attempt Was | 
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made to obtain the necessary data from the records of graduate 
schools which recommend or require their foreign applicants to 
take both TOEFL and the GRE. One hundred forty schools, each 
of which enrolled at least 50 foreign students, were contacted by 
letter. Each school was asked to supply the GRE Aptitude and 
TOEFL scores and graduate school grade-point averages (GPAs) 
for all foreign graduate students presently or previously enrolled 
in the past two years. Information was also sought on the number 
of semesters on which the GPA was based, whether the student 
withdrew from the university, and the major field of each student. 

Of the 140 schools contacted, 24 schools provided usable data 
on а total of 975 foreign students. Seventy other schools responded 
by either returning unusable data (e.g, GRE or TOEFL scores 
missing) or indicating that the requested information was not 
available or not retrievable from their records. 

Results and discussion—Test performance of foreign students. 
The means and standard deviations of the TOEFL and GRE 
scores for the study sample and for selected reference samples 
are indicated in Table 1. The reference sample for TOEFL con- 
sists of 113,975 foreign students seeking admission to institu- 
tions of higher education in the United States who took TOEFL 
from February 1964 through June 1969. The reference sample for 
GRE consists of approximately 539,000 candidates, probably al- 
most all native Americans, who took the Aptitude test from May 
1966 through April 1969. 

The foreign students in the study sample scored, on the average, 
over one-half of a standard deviation above the mean of all 
foreign applicants on TOEFL, very likely because they consisted 
of enrolled students who were selected, at least to some extent, 
on the basis of their TOEFL scores. The mean GRE scores of the 


TABLE 1 


Means and Standard Deviations for the Study Sample and Reference 
Applicant Samples on TOEFL and GRE 


Study Sample Reference Sample* 
Test Mean sD Mean sD 
TOEFL 537 65 487 78 
GRE-Verbal 348 96 516 129 
GRE-Quantitative 609 128 pan NE 


* Foreign applicants for TOEFL; native applicants for GRE. 
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sample indicate that there is a great discrepancy, relative to 
American students, between their verbal and quantitative abilities 
as measured by the GRE. As a group, the subjects are more than 
one standard deviation below the mean on GRE-V but more than 
one-half of one standard deviation above the mean on GRE-Q. 
The superior quantitative scores of the subjects appear to be re- 
lated to the fact that half of them were majoring in courses re- 
quiring extensive use of this ability. The mean GRE-Q score of 
those majoring in engineering, technology, and mathematics was 
670 as compared to the mean of 547 of all the other majors. 

А relatively high correlation (.70) was found between GRE- 
V and TOEFL indicating that the two tests are, to a large degree, 
measuring the same ability or proficiency. However, since the 
reliability of GRE-V is .93 and that of TOEFL is .97 the tests 
can hardly be taken to be parallel measures of the same linguistic 
skills. { 

Predictive validity. The use of grades as the criterion of grad- 
uate school success made it necessary to consider the problem of 
different grading standards at the participating schools. Since & 
grade of A in one school might be equivalent to a grade of B 
in another school, serious error may be introduced into any pre- 
diction system that did not adjust for these differences. Tucker 
(1963) developed a central prediction system useful for pooling 
data across a number of schools in order to increase the sample 
size for meaningful regression analysis. The central prediction 
system can be used to compute one set of regression weights which 
apply to all schools. The differential grading problem is solved 


by introducing additive and multiplicative constants for adjust- Ў 
ing the predicted grades in each school. These two constants are _ 


determined in part by the variability and average level of a par- 


ticular school’s GPA distribution, The central regression weights _ 


are determined in conjunction with the school constants such that 


а least Squares error function is minimized. The regression weights _ 
and validity for any particular school are determined by informa- - 
‘tion unique to that school and also determined by information _ 


derived from all the other schools within the system. 

The initial analysis consisted of combining GRE-V or Q with 
.TOEFL-Total in a linear multiple regression through the central 
prediction system. Since there was a possibility that different 
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abilities would be required for success in different fields, the central 
prediction analyses were conducted by major field in those fields 
which had a sufficient number of subjects. Table 2 indicates the 
number and percentage of subjects in each major field and their 
test scores and GPAs. The catch-all category “other” consists of 
all students not majoring in engineering, technology, mathematics, 
or natural sciences. 

Table 3 indicates the average validities (weighted by the number 
of cases at each school) of the predictors and certain predictor 
composites by major field. It can be seen in Table 3 that the best 
single overall predictor is GRE-Q with a validity coefficient of 
82 for all subjects. Only in the major field category of “other” is 
GRE-Q less valid that TOEFL. TOEFL, however, with a validity 
of 39 in the “other” category, is not significantly different from 
GRE-V with a validity of 35 (t x 1.76, p > 05). It is also ap- 
parent from Table 3 that the linear combinations of GRE-V or Q 
with TOEFL do not result in significantly higher validities over 
those obtained when a single best predictor is used alone. 

Further analysis of the data was made to determine whether 
TOEFL moderates the relationship between GRE-V and GPA. 
Three equally-sized English proficiency groups (low, middle, and 
high) based on total TOEFL score were formed within the major 
fields of engineering, technology, and mathematics and “other” 
(the natural sciences group did not consist of sufficient cases for 
analysis), The mean weighted validities of the GRE aptitude tests 
within the subgroups are presented in Table 4. The validities for 
some of the subgroups are substantially higher than the corre- 
sponding validities for the total group. In the major field of 
engineering, technology, and mathematics the validity of GRE-V 


TABLE 2 
Test and GPA Means by Major Field 
GRE 

Major Field N % TOEFL M Q GPA 
па баве 

echnology, апа 
Mathematics - 42 5 544 360 60 3.47 
Natural Sciences 176 18 522 320 610 3.32 
Other 307 32 534 343 51 8.31 
All Subjects 975 100 537 348 609 3.39 
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TABLE 3 n. 
Validities of Predictors and Predictor Composites by Major Field un 


TOEFL&  TOEFL& 
GRE-Q 


Major Field СВЕ-У GRE-Q TOEFL GRE-V 


Engineering, 
"Technology, and 


Mathematics .22 .39 .21 23 .39 
Natural Sciences 41 .59 .39 .42 .61 
Other .35 .28 .39 .39 .39 
All Subjects .24 .32 .26 .27 


is raised from .22 to .35 in the low proficiency group and to .3 К 
in the middle proficiency group. In the same major field the va-- 
lidity of the GRE-Q is raised from .39 to .56 in the low proficiency — 
group. In the “other” major field the validity of GRE-V is raised - 
from .35 to .44 in the middle proficiency group and that of GRE-Q | 
increased from .28 to .35 in the middle proficiency group and to 
.37 in the high proficiency group. Although these results suggest | 
that TOEFL may ђе a moderator of the GRE in the prediction oi ү | 
graduate GPA, the hypothesis that high TOEFL scorers are more 
predictable by the GRE is only partially supported by the re- 
sults for the “other” group. In the engineering, technology, and. 
mathematies group the opposite of what was predicted resul 
The low proficiency group is apparently more predictable b) 
either aptitude test than the high proficiency group. 

Perhaps the most noteworthy finding of this study. is that, i 
general, foreign students appear to succeed in American graduat 
schools in spite of scoring more than one standard deviation belo} 
the mean of American students on GRE-V. This finding sugg 
that the scores of foreign applicants on this test should be їп 


TABLE 4 
GRE Aptitude Test Validities for Low, Middle and High 
English Proficiency Groups 
hee dim 
Low ay High 

Major Field GRE-V GRE-Q GREY ORES GRE-V "GR 
Engineering, 
"Technology, and 
Mathematics | .35 .56 .36 42 .21 


Other 30 195 а 135 38 


E 
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preted cautiously when evaluating their aptitude for graduate study. 

Conclusion. The results of this study indicate that an English 
proficiency test such as TOEFL may raise the validity of the 
GRE aptitude tests in predicting foreign students’ graduate school 
GPA. However, the hypothesis that high TOEFL scorers would 
be more predictable by GRE than those less proficient in English 
is only partially supported by the results. Furthermore, it appears 
that foreign students with low English verbal aptitude can succeed 
in American graduate schools. 


REFERENCE 
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INTERRELATIONSHIPS AMONG SAT, CLEP, HIGH 
SCHOOL AND JUNIOR HIGH SCHOOL ACHIEVE- 
MENT TESTS, AND HIGH SCHOOL AVERAGE 


MAX WEINER лхо PATRICIA M. KAY 
The City University of New York 


Tne purpose of this paper was to report intercorrelations among 
scores obtained on the College Entrance Examination Board's 
Scholastic Aptitude Tests-Verbal (SAT-V) and Mathematics 
(SAT-M), the College Level Examination for Placement-English 
(CLEP-E) and Mathematics (CLEP-M), high school averages, 
and relatively inexpensive standardized high school and junior 
high school achievement tests. 

During the spring of 1970, The City University of New York 
(CUNY), as part of its open admissions program, administered 
the Stanford High School Reading Test (Form W) and the Stan- 
ford Arithmetic Test: Advanced, Computation Subtest (Form W), 
а junior high school test, to approximately 32,000 high school 
seniors already admitted to the University. These tests are des- 
ignated in this report as Open Admissions Tests-Reading 
(OAT-R) and Open Admissions Tests-Arithmetic (OAT-A). The 
OAT results were used for budgeting and planning of remedial 
activities as well as for providing information for counseling pur- 
poses. In addition to the ОАТ”, those students who so desired 
took the CLEP-E and/or CLEP-M. A large number of the stu- 
dents had taken the College Board’s SAT's. Both SAT scores and 
high school averages were obtained for this study from admis- 
sions records. 

As a first step toward obtaining evidence concerning OAT va- 
lidity, the present study of the interrelationships among these var- 
iables was undertaken. A corollary question concerned the utility 
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of the OAT for students at the upper levels of high school achie 
ment. A pilot study (Kay, Tittle, and Weiner, 1971) condue 
prior to the OAT testing had indicated that the Stanford 
school level reading test and junior high school level arithm 
test would be appropriate for the changed population of шоо 
freshmen. Шү 

Method. Intercorrelation analyses were carried out separate! 
for two total groups: all students who had taken both SAT and — 
ОАТ (N = 17,137), as well as all students who had take 
both CLEP and OAT (М = 3,296). In addition, three separa 
analyses were conducted within each total group according to high о 
school average: 80.0 and greater, 70.0 to 79.9, and 69.9 and 
below. ү 

Results. The results of the two analyses for the two primar 


averages usually reported. That the correlation between SA’ 
and OAT-R was .80 indicated that a large proportion of ver 
was shared by the two tests. 

The SAT-M and OAT-A correlation of .69 was lower. In generi 
the CLEP-OAT correlations are slightly lower, .75 for CLEP. 
and OAT-R, and .66 for CLEP-M and OAT-A. 

The results of the six intercorrelation analyses in which hi 
school averages were restricted are presented in Tables 3 thro 
8. On the whole, the SAT-CLEP-OAT correlations were lo 


TABLE 1 
Correlations among OAT, SAT and High School Average for 
Total Group of Examinees 
М = 17,187 
Test Variables ОАТ-А  OAT-R  SAT-M  SAT-V Н. Avge | | 

ОАТ-А 1.00 .53 .69 AT .56 
OAT-R 1.00 .62 .80 D 
SAT-M 1.00 65 
SAT-V 1.00 


ELS. Avg. 
X 31.43 43.26 468.87 435.25 
SD 8.16 10.31 108.15 105.69 
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as would be expected when the groups are restricted on one 
achievement variable. The main purpose of conducting this anal- 
ysis however, was to find out whether the тв would be sub- 
stantially similar for all three groups of students (high high school 
average, middle range, and low). In point of fact, the resulting 
78 indicated slightly stronger relationships between SAT-OAT and 
CLEP-OAT for students with high school averages greater than 
80 than for the other two groups. 


TABLE 2 


Correlations among OAT, CLEP and High School Average 
for Total Group of Examinees 
N = 3,296 
ا‎ АНА ee 


Test Variables OAT-A OAT-R CLEP-M CLEP-E HS. Avg. 
OAT-A 1.00 .58 .66 .50 .59 
OAT-R 1.00 .60 .75 .57 
CLEP-M 1.00 59 .63 
CLEP-E 1.00 .56 
HS. Avg. 1.00 
x 34.55 47.06 511.81 459.08 83.20 
8D 7.18 9.87 93.81 96.50 7.95 

TABLE 3 


Correlations among OAT and SAT for Ezaminees with High School 
Averages Greater Than or Equal to 80 
М = 8,977 
РОО, 


Test Variables OAT-A 
ОАТ-А 1.00 EZ 60 80 
Qara 1:00 151 116 
ВАТ-М 1:00 52 
SAT-V 1:00 
= 35.13 47.74 52.01 48.34 
8р 5.89 8.64 9.59 9.69 

TABLE 4 


Correlations among OAT and SAT for Examines with High School Averages 
Ef: Greater Than or Equal to 70 but Less Than 80 


М = 6,676 
Test Variables OAT-A OAT-R SAT-M SAT-V 
OAT-A .39 n .30 
OAT-R "m 1.00 150 14 
SAT-M 1.00 153 
ad 30.22 
39.44 42.31 | 
8р EA 9.52 9.05 8.79 


А E re c RE 
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TABLE 5 
Correlations among OAT and SAT for Examinees with High School 
Averages Less Than 70 
N = 1,484 
Test Variables OAT-A OAT-R SAT-M БАТУ. | 
ОАТ-А 1.00 .36 .54 .26 
OAT-R 1.00 „46 .67 
SAT-M 1.00 529 
SAT-V 1.00 


SD 


TABLE 6 


Correlations among CLEP and OAT for Examinees with High School 
Averages Greater Than or Equal to 80 
N = 2,296 


Test Variables OAT-A OAT-R CLEP-M CLEP- 
OAT-A 1.00 .36 .54 .82 
OAT-R 1.00 .45 .67й 
CLEP-M 1.00 ү 
CLEP-E 1.00 — 
2 36.95 50.16 54.39 48.71 
SD 4.57 7.75 8.37 8.82 


TABLE 7 


Correlations among CLEP and ОАТ for Ezaminees with High School Averages 
Greater Than or Equal to 70 but Less Than 80 


М = 821 
Test Variables OAT-A OAT-R CLEP-M CLE! 

OAT-A 1.00 .46 .61 .88 
OAT-R 1.00 47 E 
CLEP-M 1.00 .51 
CLEP-E 1.00 — 
x 30.51 41.65 44.93 40.71 
sD 8.08 9.92 7.19 8.19 
pad EME VOA Oly Cas iy TRE NE ВА 


Summary. This study has presented findings to the effect that | 
relatively inexpensive standardized tests which measure achieve- 
ment in reading and mathematics below the college level may be. 
used to predict scores on the SAT and CLEP tests. The lack of 
substantial differences in 7’s from one group to another leads опе 
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TABLE 8 


Correlations among CLEP and OAT for Examinees with High School 
Averages Less Than 70 


М = 179 
Test Variables OAT-A OAT-R CLEP-M СІЕР-Е 

OAT-A 1.00 AT .58 .33 
OAT-R 1.00 48 .64 
CLEP-M 1.00 42 
CLEP-E 1.00 
x 22.41 32.21 38.74 33.89 
8D 8.50 9.84 4.50 6.45 


to believe that the standardized high school and junior high school 
tests are equally useful for students at varying levels of high 
school achievement. Other validity data concerning the use of 
OAT's will continue to be sought. 
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CROSS-VALIDATION OF A BEHAVIORAL MODEL FOR 
PREDICTING SCHOOL SUCCESS* 


DOLORES MUHICH 
Southern Illinois University 


Tur major objective in this study was the structuring of a 
predictive model that would assess combinations of variables 
(including multiplicative and second-degree curvilinear relation- 
ships) that most effectively and parsimoniously measure and fore- 
cast college success. 

According to Kelly, Beggs, McNeil, Eichelberger, and Lyon 
(1969) two types of predictor variables helped increase predict- 
ability: (1) generally, those predictor variables which help to 
increase the multiple coefficient of determination (R?) have low: 
correlations with other predictor variables, but high correlations 
with the criterion; and (2) the exception is the suppressor vari- 
able, which increases R2, but is correlated extremely low (near 
zero) with the criterion, but highly correlated with another pre- 
dictor which, in turn, is highly correlated with the criterion. They 
also called attention to the problem of using higher order poly- 


nomials: 


When the reliability of the original rectilinear vector is mod- 
erate (r — .75) or less, the higher order polynomial will geo- 
metrically increase the unreliability (Kelly et al., 1969, p. 191). 


McNeil and Spaner (1970) declared that highly correlated pre- 
dictor variables can be used when there is & requirement that the 


—— 

1This study is based on a dissertation submitted in partial fulfillment of 
the requirements of the PhD in Education degree from Southern Illinois Uni- 
versity, Carbondale. The author is indebted to the Department of Guidance 
and Educational Psychology and Research and Projects for financial assistance. 
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TABLE 1 
Correlation between Predictor Variables and Criterion 


Criterion ۱ 
Total College Current Achievement 
Predictor Variable GPA Quarter GPA Points 
High School Grade 
Point Average (HS 
GPA) X Emotional 
Expression .20 .04 .19 
HS GPA x HS GPA .58 .30 .28 


predictor variables account for a certain number of group mem- - 
bership vectors; or, more importantly, when there is empirical - 
and theoretical justification for their inclusion. A similar defense 
can be made for the inclusion of interaction vectors. In the Muhich — 
(1970) study, some interactions were moderately correlated with 
the criterion while many were zero, or near zero, as the coefficients 
in Table 1 reveal. The r between HS GPA and Emotional Ex- 
pression was .01. 8 
Problem. The major problem in this investigation was to de- 
termine which equation was most valid and simultancously rela- 
tively parsimonious in predicting college success by structuring ® 
predictive model that would assess combinations of variables | 
(including multiplicative and second-degree curvilinear relation- 
ships). 
The psychological model proposed for this purpose, which leads | 
to the generation of propositions and hypotheses for future testing, _ 
is represented by the following equation: 


Y, = КР, 8, 8.) 
where 
У. = SELECTED CRITERION MEASURES 


1. Achievement Grade Points for Subject “а” 
2. Current Quarter GPA for Subject “а” 
8. Total College GPA for Subject “а? 


P = WITHIN PERSON VARIABLES 
1. Values-Motivation-Goals 


а. Choice of Career 
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b. Degree Objective 
с. Educational Major 


2. Problem Solving Ability 
a. Convergent Thinking 
1. ACT Composite and English, Mathematics, Social 
Science, and Natural Science subscores 


2. High School GPA: English, Mathematics, Social 
Studies, and Natural Science Grade Points 


b. Divergent Thinking 


1. Unusual Uses 
2. Common Situations 


3. Personality: Self-Report Inventories 


4. Perception of Classroom Environment 


a. Instructional Methods Preferred 
b. Preference of Class Size | 
c. Evaluation of Teaching Objectives and Living- 
Learning Environment 
5. Biographical Information 
a. Sex 
b. Age 
c. Marital Status 
d. Size of High School Graduating Class 
5, = CHARACTERISTIOS OF FOCAL STIMULI: Classroom 
tasks and activities (past and present) which lead to arousal 
of stimuli via neural pathways. 
8, = CHARACTERISTICS OF THE CONTEXT: Student 
Time Index 


Procedures. All students enrolled in Introductory Educational 
Psychology during the 1970 Winter Quarter at Southern Illinois 
University (predominantly Juniors) were subjects for this study 
(N = 789). А packet containing а Self-Report Inventory (Mu- 
hich, 1970), was distributed to students within a one-week inter- 
val; and a comparable deadline date was determined for each sec- 
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tion. Students completed inventories out of class, earned extra рош 
for returning them by an assigned date; and completion of inven- 
tories was a class requirement. " 

Those students for whom ACT scores and High School GPA’s _ 
were available (N = 426) were randomly divided into a Pre - 
diction Sample (N = 213) and a Validation Sample (N = 213). _ 
Partial] regression weights were determined on the Prediction! 
Sample and cross-validated on the Validation Sample. 8 

The procedure began by including all of the relevant predictor f 


n 


cance. The extent of overfitting of the models chosen was examined - 
by applying the empirically derived partial regression weights to 
the second random sample, the Validation Sample. The final con: - 


yielded the relatively highest average Fisher 2 value and gav 
simultaneously most parsimonious selections of variables. 


tiple linear regression technique (Kelly et al., 1969). 

Results. The full model with 92 predictor variables produced 
highest average R (.79) and the least amount of R? shrinkage (.05 
in the criterion Total College GPA. The two most parsimonio ү 
as well as predictive models contained two predictors: (1) ACT 
Total and High School GPA: average R = .58 and a .09 shri 
age in R?, and (2) High School GPA Squared and High School. 
СРА x Emotional Expression: average R = .57 and an № 
shrinkage in R2. 

In all instances, Total College GPA was the most predictive | 
criterion, producing the highest R? values. 
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THE PREDICTIVE VALIDITY OF THE AMERICAN 
COLLEGE TEST FOR STUDENTS FROM 
LOW SOCIOECONOMIC LEVELS' 


RAY MERRITT 


Delta State College 
Cleveland, Mississippi 


Tue predictive validity of the American College Test (ACT) 
has been established by the American College Testing Program 
(1965) and by individual researchers (Hoyt and Munday, 1969; 
Munday, 1967). Though these investigations have indicated ACT 
scores can be used to foretell college grades of students from the 
general population, few efforts have been made to assess the pre- 
dictive validity of the ACT for persons from specific socioeconomic 
backgrounds. Since students from low income families on the 
average do make lower scores on the ACT than do persons from 
other socioeconomic backgrounds (Merritt, 1970), the ACT may 
be less valid as a predictor of their college academic success. The 
purpose of this study, therefore, was to determine predictive va- 
lidity of the ACT for students from low socioeconomic levels. 

Method. The sample for this study consisted of individuals 
employed on the Federal Work-Study Program during the school 
sessions of 1968-69 and 1970-71. All members of the sample 
earned a minimum of $50 during the fall semester by working 
оп this labor program, had a composite ACT score on record with 
the college, and lived in either dormitory facilities or apartments 
for married students, The Work-Study Program eligibility re- 
quirements assured that all were from low income backgrounds. 

A regression line between fall semester GPA and composite ACT 
Score was computed for the 204 persons enrolled during 1968-69. 


5 
"Study funded by the Office of Research of Delta State College. 
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The values of this line were then used to develop a projected GPA 
for each of the 347 members of the 1970-71 group. A correlation 
coefficient r between the predicted and earned GPA’s of the 1970- - 
71 students was computed and tested for significance. 
Findings. The data indicated that these two groups had identical 
average ACT scores (19.0), and the 1970-71 students earned а 
2.51 mean GPA as compared to a 2.50 for the 1968-69 group. ‘ 
Both GPA’s were computed on the 4.0 basis. The correlational 
coefficient between predicted and earned GPA for the 1970-71 stu- 
dents was .71 and significant beyond the .001 level. 
Further analysis of the data revealed a lower predictive validity _ 
coefficient for boys (.32) than for girls (.50). Both r's, however,” 
were significant beyond the .001 level. These findings are reported 
in Table 1. 
Discussion. The results of this investigation indicate that the 
ACT can be used to predict the academic performance of college 
students from low socioeconomic backgrounds who are employed 
on the Federal Work-Study Program. The .71 correlation coefficient 
which represents a relationship large enough to provide rela- _ 
tively accurate prediction should be interpreted cautiously, how- 
ever, because of the likelihood that the samples of males and 
females came from different populations. Combining the two scat- 
terplots for the two sexes results in an elongated scatter diagram 
reflecting a high degree of correlation but a misleading basis for 
prediction in view of the sex differences present. Although the re- 
lationships of .50 for females and .32 for males were, respectively) 
moderate and low, they were statistically significant and some- 
what promising for differential prediction of college achievement: 
These findings tend to support the hypothesis that the com- 
posite score of the ACT is a valid predictor of college grades for _ 
students from a low socioeconomic background. 


TABLE 1 
Predictive Validity Eee between ACT and GPA for Students 
from Low Socioeconomic Levels 
Mean Mean р 
N ACT GPA r Li 

Total group 347 19.0 2.51 71 <.00 
Females 194 19.3 2.70 ‘50 < 
Males 153 18.6 2.29 .32 < 
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THE RELATIONSHIP OF STRUCTURE-OF-INTELLECT 
FACTOR ABILITIES TO PERFORMANCE IN HIGH 
SCHOOL MODERN ALGEBRA 


KEITH A. HOLLY 
Duarte Unified School District 
WILLIAM B. MICHAEL 
University of Southern California 


Purposes. Based on the use of factor tests that represent an 
operational translation of the constructs of Guilford’s (1967) 
Structure-of-Intellect (SI) model, the major objectives of this 
study were to ascertain through employing a sample of 177 
secondary school students in а middle-class suburban community 
in Southern California comparative validities of the following 
combinations of predictors relative to each of two criterion mea- 
sures in a modern algebra course: (1) SI factor tests alone, (2) com- 
mercially published achievement and aptitude measures alone 
(COMM tests), (3) achievement level in eighth grade mathematics 
courses (MATH) in terms of grade point average (GPA) and SI 
factor tests, and (4) COMM tests and MATH. This investiga- 
tion closely parallels one involving prediction of success in tenth- 
grade geometry by Caldwell, Schrader, Michael, and Meyers 
(1970). 

Methodology. Stepwise multiple-regression analyses along with 
double cross-validation procedures employing even- and odd- 
numbered students were used in the prediction of each of two 
criterion measures—GPA in modern algebra course and perfor- 
mance on the Cooperative Mathematics Tests, Algebra (CMT)— 
from optimally weighted composites of predictor variables derived 
from 15 SI factor tests, nine scales of the California Achievement 
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Tests, three of the California Test of Mental Maturity, and | 
MATH. 3 

From an independent pilot group of 34 students reactions were 
sought and evaluated for 20 SI factor tests. In addition efforts 
were made to match objectives in the modern algebra curriculum - 
developed by participating teachers with psychological processes 
represented in the SI factor tests. 

This pilot endeavor resulted in the selection of 15 SI factor 
tests. Psychological processes of evaluation, memory, cognition, 
divergent production, and convergent production were hypothesized 
to be contained in five, two, four, two, and two measures, respeo- 
tively. One test contained semantic content, and fourteen tests 
symbolic content. Products of classes, relations, systems, transfor- 
mations, and implications were judged to be reflected in five, 
three, two, two, and three tests, respectively. 

Findings. Although the validity coefficients of the two initially. 
chosen SI factor-test composites were .60 and .56 relative to CMT 
and GPA criteria, respectively, the corresponding validities of five 
COMM composites varied between .38 and .58 (CMT criterion) 
and of five COMM composites (three being same ones as for OMT 
criterion) between .25 and 44 (GPA criterion). For the same two 
criteria, respectively, the validity coefficients of the SI factor-test | 
and MATH composites were .59 and .60. Ranges in validity 
coefficients within the COMM-MATH composites for the CMT 
and GPA criteria were, respectively, between .46 and .54 (four 
composites) and between .44 and .51 (16 composites). The adminis- 
tration time of the SI factor tests was approximately one-half to 
only one-third that of the COMM composites. ; 

For the following four optimally weighted composites of (1) 
four SI factor tests against a GPA criterion, (2) four SI factor 
tests relative to a CMT criterion, (3) three SI factor tests and 
the MATH predictor with a GPA criterion, and (4) two SI 
factor tests and the MATH predictor in relation to the СМТ. 
criterion, the multiple correlation coefficients (R’s) prior to Я 
subsequent to cross validation for even-numbered students were, 
respectively, .57 and .43, .64 and .36, .58 and .32, and .64 and 50; 
and for odd-numbered students, respectively, .61 and .35, .52 ап 
47, 64 and 47, and .53 and 45. After double cross-validation 
the reduced multiple E's for composites involving SI factor-testë E 
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either with or without the MATH variable were still comparable 
to those found for optimal combinations of COMM tests that were 
not cross-validated. 

Conclusions. The following five conclusions resulted: (1) Opti- 
mally chosen combinations of SI factor tests were substantially 
related to performance in a secondary school modern algebra 
course regardless of whether CMT attainment or GPA was em- 
ployed as the criterion (although modest decrements in validity 
coefficients occurred in double cross-validation procedures). (2) 
When the MATH variable was added to the previously obtained 
optimal composite of SI factor tests it contributed little if any 
additional valid variance to prediction of either criterion measure, 
although the MATH variable by itself was а moderately valid 
predictor, (3) Although the SI factor measures of the psychological 
process of evaluation were significantly correlated with the two cri- 
terion variables, SI factor tests representing divergent and conver- 
gent production were relatively more valid in the prediction of the 
GPA criterion, whereas those tests hypothesized to reflect, opera- 
lions of cognition and memory were relatively more valid {ог the 
prediction of CMT performance. (4) Irrespective of which criterion 
of success in modern algebra was chosen an optimal composite of 
SI factor tests based on stepwise regression analyses was not only 
consistently more valid (in 30 comparisons out of 30) than simi- 
larly weighted combinations of COMM tests but also much less 
time-consuming in administration. (5) Relative to either success 
criterion, an optimal composite of SI factor measures and tue 
MATH variable was without exception not only more predictive 
but also with few exceptions less time-consuming than were opti- 
mally weighted composites of COMM tests and the MATH mea- 
sure, 

' Recommendations. On the basis of the findings it would appear 
that mathematics instructors should consider seriously the use of 
SI factor tests for selection and placement of students, although 
only five to eight such tests probably need to be used to duplicate 
the major psychological operations involved in the curricula. 
Efforts need to be expended to minimize the time and effort E 
quired for the scoring of SI factor tests so that the time-saving 
advantages afforded by their administration may be maintained. 
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THE CALIFORNIA COMPREHENSIVE TEST OF BASIC 
SKILLS: A PREDICTOR OF SUCCESS FOR 
HIGH SCHOOL FRESHMEN 


JAMES S. NOLAN Ax» JAMES JACOBSON 


St. Peter's College, Jersey City and North Bergen Board of Education, 
New Jersey 


Тнв purpose of this study was to summarize data on the validity 
of the scores on the subtests and on the total battery score of 
the California Comprehensive Test of Basic Skills (CTBS) as 
well as on the total IQ Scale of the California Short-Form Test of 
Mental Maturity (CTMM) in relation to academic success in 
two specific subjects, English and Mathematics at the ninth grade 
level. The areas selected were those most frequently referred to 
by researchers in their efforts to establish norms for measuring 
suecess throughout the four years of high school. The study was 
designed to reveal correlations between scores ОП achievement 
tests administered at the eighth grade level and course grades of 
100 students at the end of the ninth grade who were randomly 
selected from a total high school class of 512 students. | 

Procedure. As indicated in Table 1, coefficients of correlation 
Validity were found between each of the two criterion measures 
and each of five selected predictor variables and tested for statis- 
tical significance. 

Findings and discussion. Although the validity coefficients for 
the Reading score (.55) and Mathematics score (.52) were 
significant in relation to the criterion of success in ninth grade 
English, the Language score (.60) was the most valid predictor of 
the three subtests. The Total Battery score аз à predictor of suc- 
cess in English provided the highest correlation (.63). The In- 
telligence Quotient exhibited limited value as & predictor for ninth 
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ТАВГЕ 1 


Validity Coefficients between Each of Two Criterion Variables 
and Selected Predictor Variables 


Criterion Variables 
ish Mathematics 
Predictor Variables radi 


* All correlation coefficients significant beyond the .01 level except the coefficient of .28 which 
was significant at the .05 level. 


grade English (.46) as well as for ninth grade mathematics (.34) 
The Total Battery and the Language score provided mode 
validities as predictors for ninth grade mathematics (.46 and 
The predictive validity of the Mathematics score for ninth 
mathematics (.54) and for ninth grade English (.52) indicate 
general usefulness. The predictor of least validity for ninth g 
mathematics was Reading (.28). 

In general Achievement Test scores appeared to be more V 
predictors of grades in English and Mathematics courses than ¥ 
Scores on a scholastic aptitude or general intelligence test. R 
lications of this study with other samples could be expected # 


result in slight changes in the size and pattern of the valid 
coefficients, н ur 
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THE FACTORIAL VALIDITY OF A RATING SCALE FOR 
THE EVALUATION OF RESEARCH ARTICLES 


CARLETON В. SHAY Ax» WAYNE 8. ZIMMERMAN 
California State College, Los Angeles 
WILLIAM B. MICHAEL 

University of Southern California > 


uw purpose of the investigation was to identify the factorial 
dimensions of 25 characteristics on a scale for rating research 
articles, The scale was devised by the Committee on Evaluation 
of Research of the American Educational Research Association 
under the chairmanship of Edwin Wandt (Wandt, Adams, Collett, 
Michael, Ryans, and Shay, 1967; Adams, Collett, Michael, Ryans, 
Shay, and Wandt, 1967). As cited in Table 1, each of the 25 
characteristics was hypothesized as being applicable to the large 
majority of educational research articles, regardless of the meth- 
odology employed, although, admittedly, for a given article some 
of the characteristics might nob apply. Efforts to determine fac- 
torial validity were judged worthwhile as а means not only for 
improving interpretability of the instrument but also for reducing 
the 25 interdependent criteria to а minimum and more manageable 
number. It was thought that in the evaluation of research articles 
the new dimensions identified could be useful to editors and other 
critical reviewers as well as to professors and students in re- 
search methodology classes. 

Methodology. A proportionately stratified sample of 125 edu- 
cational research articles from 39 different journals was drawn 
from a population of 827 research articles published in these 
journals in 1962. Each article was evaluated by an expert in edu- 
cational research who was nominated by the committee. Forty- 
one of the articles were evaluated by & second similarly chosen 
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TABLE 1 ; 
Rotated Factor Loadings Above .40 Derived from Principal Components So 
Intercorrelations of Judges’ Ratings on Each of 25 Characteristics 
of 126 Research Articles 


Characteristics (Scales)* on the 
Evaluation Form I ООШ IV 


1. Problem is clearly stated 

2. Hypotheses are clearly stated 

3. Problem is significant 

4, Assumptions are clearly stated 

5. Limitations of the study are stated 

6. Important terms are defined 

7. Relationship of the problem to 
previous research is made clear 

8. Research design is described fully 

9. Research design is appropriate for 
the йо of the oen 

10. Research design is free of specific 

weaknesses 


11. Population and sample are described 
12. Method of sampling is appropriate 
13. Data-gathering methods or pro- 
cedures are described 
_ 14. Data-gathering methods or pro- 
cedures are appropriate to the 
solution of the problem 
15. Data-gathering methods or pro- 
cedures are utilized correctly 
16. Validity and reliability of the evi- 
dence gathered are established 
17. Appropriate methods are selected 
to analyze the data 
18. Methods utilized in analyzing the 
data are applied correetly 
19. Results of the analysis are pre- 
sented clearly 57 
20. Conclusions are clearly stated 43 
21. Conclusions are substantiated by 
the evidence presented 42 
22. Generalizations are confined to the 
population from which the sample 
was drawn 
23. Report is clearly written 
24, Report is logically organized 
25. Tone of the report displays an un- 
biased, impartial scientific attitude 


: I—Method of Analysis; II—Design; 
II—Exposition; and VIII—Objectivity. Ј 
loadings, 
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expert to obtain pairs of ratings for use as а reliability sample. 
(The reliability estimates were considered high enough to permit 
ratings by a single judge to be used). Each expert rated his article 
on the 25 characteristics. 

А product-moment correlation matrix was calculated {ог the 
ratings assigned by the 125 judges to the 25 characteristics". 
Principal components were extracted’ using the squared multiple 
В for each characteristic with the other characteristics as the 
estimate of its communality. All 15 components extracted were 
rotated by Kaiser’s Normal Varimax Method (Kaiser, 1960). 
Only eight rotated factors yielded two or more characteristics 
with loadings of at least .40. The eight factors and their factors 
and their loadings on the 25 characteristics are cited in Table 1. 

Findings. From the data presented in Table 1 the factors were 
tentatively interpreted as follows: 

Factor I—Method of analysis. The highest loadings on Factor 
I аге for the three scales specifically intended to evaluate appro- 
priateness and correctness of the methods of the analysis of data 
(18, 17, and 19). 

Factor II—Design. The two scales with the highest loadings (9 
and 10) were specifically prepared to evaluate research design; 
Scale 14, with the third highest loading, is also related to research 
design. Seale 8 (Research Design is described fully), which might 
have been expected to appear in this factor, has a loading of only 
25. The high loading of variable 14 is understandable in that 
“appropriate” data gathering methods suggest good design. The 
low loading of variable 8 suggests that the raters saw full de- 
scription of the design as something relatively independent of 
good design. 

Factor III—Sampling. The two scales with the highest loadings 
(12 and 11) were specifically intended to evaluate sampling 
Procedures; and Scale 22, with the third highest loading, is also 
related to sampling. Although the three other scales do not relate 
Specifically to sampling, they indirectly reflect it to the degree 
indicated by the loadings. 

Factor IV—Rigor. This factor includes only two highly loaded 


—_—_— 

1 Computations were performed through the courtesy of Western Data Pro- 
cessing Center, Graduate School of Business ‘Administration, University of 
California, Los Angeles. à 
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scales (5 and 4), both of which were designed to evaluate logi 
theoretical orientation. Scales 6 and 7, which were also intended 
to reflect the adequacy of such orientation, have loadings of only 
38 and .33, respectively. An emphasis on scientific rigor seems _ 
to be reflected in Scales 4 and 5, but not in Scales 6 and 7. 
Factor V—Significance. The scale with the highest loading (3) 
refers to the significance of the problem. Scales 1, 6 and perhaps. 
24 to some extent appear to involve characteristics which are 
prerequisite to determining significance. 
Factor VI—H ypothesis. The factor includes only two scales (2 
and 1), both clearly related to the explicit statement of the hypo- 
thesis and the problem. 
Factor VII—Ezposition. This factor involves only two E Я 
(23 and 24), both of which are clearly related to exposition. 
Factor VIII—Objectivity. The scale with the highest loading (25) 
was specifically included to evaluate the objectivity of the report. 
The scale with the second highest loading (21) involves objectivity 
in formulating conclusions, while Scale 20 is a necessary prere- 
quisite to determining the objectivity of the conclusions. Scale 
22, which was originally expected to cluster with Scales 20 and 
21, has a loading of only .14, demonstrating that the raters inter- 
preted this scale as being related not to objectivity of conclusions; 
but to concern with sampling. ‹ 
A comparison of the factors derived from the factor analysis 
with the a priori categorization made by the Committee reveals _ 
one major difference; i.e., “data gathering" did not emerge аз 8 
factor although three scales (13, 14, and 15) had been specifically ld 
designed to evaluate this aspect of the research articles. 1% would. 
appear that the judges did not view data gathering as а separate 
entity but subsumed such procedures under Research Design and, | 
to a lesser degree, under Analysis of Data and Sampling. 1 
Summary and conclusion. Hight interpretable factors described _ 
the dimensions among the intercorrelations of 25 rated character- | 
istics for a representative sample of 125 research articles in 39. 
different education journals published during 1962. These eight 
constructs would appear to constitute a parsimonious, meaning- _ 
ful, and viable set of concepts for the evaluation of research v 
articles. 
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CORRELATES OF THE WECHSLER ADULT INTELLI- 
GENCE SCALE, THE SLOSSON INTELLIGENCE 
TEST, ACT SCORES AND GRADE POINT 
AVERAGES 


JOHN D. MARTIN AND LINDA RUDOLPH 
Austin Peay State University 


Tux Slosson Intelligence Test (SIT) was introduced in 1963 by 
Richard L. Slosson for use аз а quick *ündividual" intelligence 
test. Slosson’s purpose in constructing this test was to provide 
an abbreviated form of the Stanford-Binet Intelligence Scale 
(S-B IS), Form L-M, which could be quickly and easily ad- 
ministered. The SIT requires from 10 to 30 minutes to administer 
whereas widely used individual intelligence tests such as the 
Stanford-Binet and Wechsler Adult Intelligence Scale (WAIS) re- 
quire a minimum of one hour for administration and scoring. The 
SIT requires no special testing materials other than the manual and 
Score sheet. 

The SIT appears to be a highly reliable and valid test for 
measuring intelligence. Slosson (1963) conducted a study with 
139 subjects ranging from four to 50 years of age and obtained 
а reliability coefficient of .97 in a test-retest interval of two 
months. Comparison with the 8-В IS on 141 subjects yielded a con- 
current validity coefficient of 92. 

Purpose. The writers noted while reviewing the literature that 
most reliability and validity studies of the SIT have been done 

_with children or with an atypical population. Slosson indicated 
that this test is a valid and reliable instrument for measuring 
adult intelligence as well as the intelligence of children. One of 
the purposes of this study was {о determine the validity of the 
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SIT on an adult population when compared with an older and 
proven test of mental ability, the WAIS. 

A second purpose of the study was to determine the degree of 
relationship between the SIT and the American College Testing 
Program examination (ACT), and between the SIT and grade 
point averages (GPAs) in order to ascertain whether the SIT can 
be used by personnel working with prospective college students to 
predict acceptance and success in college. 

Method. The sample used in this study was undergraduate stu- 
dents enrolled in lower division psychology courses at Austin 
Peay State University, Clarksville, Tennessee. The sample was 
composed of 50 students, ranging in age from 18 to 39 years. Be- 
cause ACT scores were not available for transfer students in- 
cluded in the sample, the number of subjects for the correlation 
between ACT scores and the SIT was reduced to 41. 

The WAIS was selected as the criterion with which to compare 
the SIT because of its proven and established reputation as a valid 
and reliable test of mental ability. 

The ACT was chosen as the instrument to serve as the criterion 
for predicting achievement in college because of its widespread 
acceptance and use as a measure of scholastic aptitude for enter- | 
ing students. 

SIT scores were correlated with GPAs of the subjects to as- 
certain the value of the SIT in predicting success in college. 

The WAIS and SIT were administered individually to each stu- 
dent over a period of four months. 

Results. The Pearson product-moment technique was used to 
compute the correlation coefficients. WAIS IQ scores were com- 
pared with SIT IQ's, ACT scores, and GPAs. Table 1 summarizes 
the correlations. Means and standard deviations are given in 
Table 2. 

The difference between IQ scores obtained on the WAIS full 
scale and the SIT ranged from one to 23 points, with the average 
difference between scores being 4.4. 

Discussion. Although significant beyond the .01 level, the va- 
lidity coefficient of .70 between the WAIS and SIT obtained in 
this study was slightly lower than the coefficients reported in the 
review of the literature between the S-B IS and WAIS. Since the 
SIT is composed of items from the S-B IS, it is plausible that 
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TABLE 1 
Correlations between the SIT, WAIS, ACT scores, and GPAs 


Item ү? 
1. WAIS Full Scale and SIT 70 
2. WAIS Verbal Scale and SIT лз 
3. WAIS Performance Scale and SIT 49 
4, ACT Scores and SIT .56 
5. GPAs and SIT .30 
6. GPAs and АСТ 465 


*N = 50 for all correlations except Items 4 and 6, where N = 41. All correlations were 
significant beyond the .01 level with the exception of the r of .30 obtained between GPAs and 
SIT. This correlation was significant at the .05 


the SIT would correlate higher with the S-B 13 than with the 
WAIS. 

In her discussion of validity coefficients, Anastasi (1961) stated 
that “the wider the range of scores the higher the correlation.” 
Since the range of scores in this sample is restricted to 1Q’s from 
90 to 146, the lower end of a standardization sample was omitted. 
The effect of this selection of a population would be, therefore, 
to lower the validity coefficient. 

As could be expected, it was found that the WAIS verbal sec- 
tion correlated higher with the SIT (.73) than the full scale 
WAIS correlated with the SIT (.70). Most studies show higher 
correlations between the verbal portion and other academic indices 
than between the full scale WAIS scores and academic indices. 
The correlation of .49 between the WAIS performance section and 
the SIT was also expected, as most studies have indicated that 
the performance section does not predict academic success so well 
as a verbal portion and does not correlate so highly with other 
verbal intelligence tests. 

The correlation of .56 between ACT scores and the SIT was 


TABLE 2 
N Mean SD 


1. WAIS Full Scale 50 120.24 8.60 

2, WAIS Verbal Scale 50 123.02 7.63 

3. WAIS Performance Scale 50 113.82 11.95 

4. SIT 50 124.72 10.32 

5. ACT Scores 41 20.37 4.08 
tek cg ee Ка 


462 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


comparable or slightly higher than the coefficient of .48 to .55 
reported by Anderson (1942) between the American Council on 
Education Psychological Examination (ACE) and the Wechsler- 
Bellevue (W-B) full scale scores. 

Although significant at the .05 level, the coefficient of .30 bel 
tween GPAs and SIT scores was slightly lower than those coeffi- 
cients reported in the literature for ACE scores and GPAs and 
lower than the coefficient of .465 obtained in this study between 
ACT scores and GPAs. у 

Average testing time for the SIT was about 30 minutes. Slosson 
(1963) stated that the time required to give the test is from ^ 
10 to 15 minutes for the average person. The writers believe 
that the lengthened time for administration of the SIT in this 
study was probably due to the necessity for subjects to pass “10 
in a row” to establish a basal age. It was necessary most 
of the time in testing subjects to go back to an earlier age level 
than the 15-0 level suggested to obtain a basal age. 

Conclusion. Although validity coefficients between the SIT add 
WAIS verbal and full seale scores were not so high as expected, | 
they were of sufficient, magnitudes to be significant at or beyond 
the .01 level and thereby substantiated Slosson's contention ш 
the SIT is а valid test of mental ability for adults. 

The SIT correlates highly enough with ACT scores to be con | 
sidered a valid instrument for predicting acceptance and success 
in college. 
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THE VALIDITY OF AN ABBREVIATED FORM OF THE 
STANFORD-BINET INTELLIGENCE SCALE, FORM LM. 


ROBERT J. ARMSTRONG 
Salem State College 
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Boston College 


INTELLIGENCE or academic aptitude tests play а major role in 
student evaluation. In many instances, the situation demands the 
use of an individual test such as the Stanford-Binet Intelligence 
Scale, Form L-M or the Wechsler Intelligence Scale for Children. 
Unfortunately, individual intelligence tests such as these are time 
consuming in that they require an average administration time 
of one hour, and require specialized training to administer and 
score, usually a full semester course. Thus, there is & need for а 
valid individual test of mental ability requiring no specialized 
training which could be quickly administered and easily scored. 
Such a test could be used as (а) а sereening device; (b) а re- 
testing device for both individual and group test results; and (с) 
a substitute instrument to test students who missed the group test 
administered to a class. 

In 1963, Richard L. Slosson constructed the Slosson Intelligence 
Test (SIT), sometimes referred to as the Short Intelligence Test. 
The purpose of its author was to construct an abbreviated form 
of the Stanford-Binet Intelligence Scale (S-B), Form L-M, which 
could be used as a screening and retesting instrument and thus 
83 а device to provide an opportunity for more students to re- 
ceive individual intelligence tests as well as more released time 
for counselors and other specialized personnel to devote to their 
other responsibilities. 

The SIT is an individual test of intelligence, for both children 
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and adults, requiring no specialized training, which takes only 


15 to 20 minutes to administer and score. A test-retest reliability _ 


coefficient of .97 and a standard error of measurement of 43 IQ 
score points is reported by Slosson for 139 subjects ranging from 
4 to 50 years old (Slosson, 1963). 

The S-B was used as the criterion for establishing the con- 
current validity of the SIT. Slosson cited correlation coefficients _ 
between these two tests ranging from .90 through .98 (median 
.96) for subjects whose ages were from 4 to 18 and above. More- 
over, he reported an average absolute IQ score difference of 52 
between the two tests (Slosson, 1963). 

This study was undertaken as a further validation of the SIT. 
The $-В was employed as the validity criterion. 2 
Method. The sample consisted of 724 students (ages six to 
14) enrolled in 10 public school systems in northeastern Massa- 
chusetts, Each student was administered both an SIT and an 8-B 

within à two week period of time. 

The tests were administered by using various combinations of. 


personnel. The administrators were classified in three ways: pro- и 


fessionals, trainees, and teachers. Professionals were highly 
trained and experienced personnel in the field of testing (limited 
to 3 in this study). Trainees were part-time graduate students 
(mostly teachers) who were enrolled in a course concerned with 
administering, scoring, and interpreting the S-B and who had 
administered numerous S-B's under supervision before administer- 
ing any tests in this study. The Teachers involved had no knowl- 


edge concerning the administration of the S.B. The use of рег | 


sonnel in administering the tests was accomplished as follows: 


1. 491 subjects were administered both tests by the same ad- 
ministrator ; ; 

2. 233 subjects were administered the tests by two differen 
administrators; 

3. 304 subjects were administered both tests by a professional _ 
but not necessarily the same administrator; | 

4. 368 subjects were administered both tests by а trainee, but 
not necessarily the same administrator; 

5. 52 subjects were administered one test by a professional and 


one by a teacher—neither the teacher nor the professional | 
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was aware that the subject had been or was going to be ad- 
ministered the other test. 


Pearson product moment correlation coefficients between S-B and 
SIT scores were computed for each of the following 21 categories: 


1. Overall—for all 724 subjects; 
2-10. Аде Levels—for each age level separately (6-14) ; 

11. Male—for all male subjects; 

12. Female—for all female subjects; 

13. Same administrator—for both tests; 

14. Different Administrator—for each test; 

15. Professional Administrator—for both tests, but not neces- 
sarily the same administrator; 

16. Trainee Administrator—for both tests, but not necessarily 
the same administrator; 

17. Professional and Teacher Administrators—S-B by а pro- 
fessional and the SIT by а teacher; neither was aware the 
other test was given or Was going to be given; } 

18. S-B Administered First; 

19. SIT Administered First; 

20. Same Day—when both tests were administered on the same 
day; 

21. Different Day—when the administration of the second test 
was completed between one and 14 days after the first ad- 


ministration. 


Also, the mean absolute difference between IQ scores from the 
two scales was computed for each of the preceding categories. 

Results and Discussion. Table 1 shows that the overall study 
Pearson product moment correlation between the 8-В and the SIT 
was .92. Also, the range of correlations for the nine Age Levels 
(6-14) was from .90 (ages 6 and 8) to .95 (age 13). All correla- 
tions were significant beyond the .001 probability level. These cor- 
relations approximate those reported by Slosson (1963) for these 
age levels (.94 to .98). f 

Table 2 presents the findings concerning additional categories 
which were either not reported or not 
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Time Lapse between the first and second administration. The 
range of correlations within these 11 sub-categories was from .92 
(Both Trainees) to .94 (Professionals and Teachers). All correla- 
tions were significant beyond the .001 probability level. 

The standard deviations within both Tables 1 and 2, which are 
large when compared to the standard deviation of 16 for the 
S-B itself, have perhaps produced somewhat inflated correlation 
coefficients. However, what is more important than their size, is the 
correspondence of the S-B and SIT standard deviations. 

Since Slosson’s purpose was to construct an abbreviated form 
of the S-B, these high validity coefficients (Tables 1 and 2) are, 
in a sense, reliability coefficients. At any rate, the fact that the 
correlation between the S-B and the SIT approximates the reli- 
ability of the S-B itself, indicates that the tests appear to be 
measuring similar constructs. 

Table 1 also reveals that the Overall average absolute IQ) score 
point difference between the two tests was 5.46, which is approxi- 
mately the same as the standard error of measurement of five IQ 
score points for the S-B. The range of average absolute IQ score 
point differences for the 20 subcategories (Tables 1 and 2) was 
from 4.03 (Professionals and Teachers) to 5.88 (Age 9). These 
average absolute score point differences present further evidence 
of the comparability of the scores of the two tests. 


TABLE 1 


Cross-Validation of Slosson’s Findings: Means, Standard Deviations, 
Correlations and Average Differences of IQ Scores of the 
Stanford-Binet Intelligence Scale, Form L-M, 
and the Slosson Intelligence Test 


Standard Average 

Mean Deviation Absolute 1Q 

Category N 8-В SIT 8B SIT r* Difference 
Overall 724 106.44 107.28 18.20 19.23 .92 5.46 
Age 6 77 102.52 102.88 17.36 16.73 .90 5.84 
Age 7 88 109.01 109.80 1819 18.64  .92 5.42 
Age 8 85 110.67 112.26 16.57 17.48  .90 5.75 
Age 9 83 104.83 106.45 16.45 18.601 .91 5.88 
Age 10 87 103.48 104.20 18.64 18.81  .92 5.52 
Age 11 80 104.25 105.00 18.99 20.62 .94 5.28 
Agel2. 72 107.47 108.83 18.05 19.95 93 5.28 
Age 13 80 108.38 108.83 18.60 20.58 .95 4.68 
Age 14 72 107.13 106.88 19.12 20.1] 93 5.61 


* р < .001 (All correlation coefficients are significant beyond the .001 level). 


1 


ARMSTRONG AND JENSEN 467 


ТАВГЕ 2 
Additional Findings of This Study Means, Standard Deviations, Correlations 
and Average Differences of IQ Scores of the Stanford-Binet Intelligence 
Scale, Form L-M, and the Slosson Intelligence Test 


Standard. Average 

Mean Deviation Absolute IQ 

Category N 5-В SIT SB SORE Difference 
Male 379 106.00 106.82 18.81 19.77 .93 5.20 
Female 345 106.92 107.78 17.49 18.72 .91 5.77 


Administrator 491 108.24 109.37 18.87 18.96 .93 5.50 
Different 

Administrator 233 102.74 102.87 16.03 19.19 92 

Admin. Status 

Both Professionals 304 100.90 101.30 15.24 18.58 .93 5.48 
Admin. Status 

Both Trainees 368 110.57 112.02 19.21 18.66 92 

Admin. Status А 
Professionals 

and Teachers 52 109.52 108.65 18.20 17.92 94 4.03 
Order 8-В First 440 106.50 107.96 17.34 18.45 .93 5.68 
Order SIT First 284 106.33 106.22 19.44 20.46 .93 5.68 
2nd Administration А 


Бате Пау 401 104.09 105.75 17.68 18.22 5.14 
2nd Administration 
1-14 Days 253 111.34 113.64 18.52 19.82 .93 5.0 


* p < .001 (All correlation coefficients are significant beyond the 001 level). 


Of particular interest were the results obtained when the 8-В 
was administered by а Professional and the SIT by a Teacher 
(Table 2). In that situation, neither administrator was aware that 
the subject was going to be administered two tests. It should also 
be pointed out that the Teachers had no knowledge concerning 
how to administer or score the S-B. Yet, the correlation between 
the two tests was .94 and the average absolute IQ difference was 
4.03. 

Thus, the findings suggest that the SIT can be used as a valid 
Screening and retesting substitute for the S-B and provide: (8) 
ап opportunity for more individual intelligence tests to be given; 
(b) a source of additional test, administrators; and (c) more time 
for specialized personnel to devote to their other responsibilities. 
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THE RELATIONSHIPS AMONG CLOZE, ASSOCIATIONAL 
FLUENCY, AND THE REMOVAL OF INFORMATION 
PROCEDURE—A VALIDITY STUDY 


J. JAAP TUINMAN 


Institute for Child Study 
Indiana University 


In a cloze test the subjects are asked to fill in missing words in 
continuous prose. Commonly every fifth word is deleted and a 
response is scored correct if it matches the word which originally 
appeared in the passage (Taylor, 1953, 1954, 1957). 

The Removal of Information Procedure (RIP) employs @ те- 
verse strategy: subjects are asked to delete that word in every 
five-word segment that has the lowest probability of being filled 
in correctly by other subjects presented with the mutilated pas- 
sage. For each word in a RIP-task the probability that it would 
be guessed correctly if deleted is calculated a priori. This is done 
by presenting the RIP-passage as five subsequent 20 per cent 
cloze tasks to different groups of subjects. Thus, for each word 
in the passage the probability of а correct fill-in is known. A 
student’s RIP-score is usually obtained by calculating the mean 
fill-in probabilities of the specific words deleted by him (Tuinman, 
1970-1971. 

The question has been raised as to the mechanisms underlying 
cloze performance. Ohnmacht, Weaver, and Kohler (1970) postu- 
lated а relationship between Cloze and Associational Fluency as 
Measured by the Controlled Associations Test (FA-1) and by As- 
sociations IV (FA-3) (French, Ekstrom, and Price, 1963). These 
authors reported correlations between FA-1 and various cloze-forms 
in the 37 to .50 range and between FA-3 and these cloze tests in 
the 36 to .44 range. 

More recently Byrne, Feldhusen, and Kane (1971) reported 4 
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weak relationship between FA-1 and a regular every fifth word 
cloze test: “the difference between high and low groups in as- 
sociational fluency was nearly significant [p. 390].” 2 

In the present study the relationship of associational fluency to _ 
both performance on a cloze task and performance on the reverse 
task, RIP, was studied. The subjects were 45 graduate students 
in education. Data on four participants were lost because of their 
absence during one of the testing sessions. The tests were admin- 
istered on three separate days, a week apart, in the following order; 
RIP, Cloze, FA-1, and FA-3. 

Table 1 contains the descriptive data for all tests. Plots of the 
frequency distribution of scores indicated a rather pronounced 
negative skew for the Cloze variable and a positive skew for the 
FA-3 variable. For that reason both Pearson product moment cor- 
relations and Kendall’s Tau Coefficients are presented in Table 2. 
The only relevant conclusion permissible from the data seems to | 
be that in this study only one of the associational fluency tasks, 
FA-3 has a linear correlation with the cloze task. (No curvilinear 
relationships were indicated by the scattergrams.) The magnitude 


TABLE 1 
Means, SDs, Spearman-Brown. Reliabilities and SEm’s for all Variables 


Test X SD Tos SE, Я 
ВТР .188 .032 67 .018 
Cloze 32.56 4.06 69 2.23 
ЕА-1 21.85 8.57 78 4.03 
FA-3 10.63 3.87 51 2.71 
TABLE 2 


Pearson Product Moment Correlations and Kendall Tau Coefficients 
(т Parentheses) (М = 41) 


a) (2) (3) (4) 
(1) RIP - –. A —.03 
(2) Cloze “а or. x A te 
(8) FAL wad Са” 
(4) FAB ini 


жр < 05. 


J. JAAP TUINMAN ат 


of this relationship is small, as in the other studies mentioned 
above. 

The RIP task performance seems not at all related to facility 
on the association tasks. In addition, a nonchance correlation 
exists between RIP and Cloze, but it is almost negligible. Although 
this observation is in accord with previous findings (Tuinman, in 
press), it is still an unexpected phenomenon which deserves further 
study. It may be mentioned that correction for attenuation basi- 
cally does not affect the relationship among the variables as de- 
scribed above. 

Discussion. Cloze tasks are widely used both for determining 
difficulty levels of reading material and for the measurement of 
reading comprehension. Relatively little is known, however, about 
construct validity aspects of the task. For that reason, studies at- 
tempting to discover the relationship of Cloze to identifiable psy- 
chological factors, are useful. It is not surprising that such studies 
have involved markers of associational fluency. It appears that a 
student's success on а cloze test, and on а RIP test for that 
matter, should be related to his ability. to produce options from 
which he can choose. (In the case of the RIP task such behavior 
is postulated as an explanation of why subjects do decline to delete 
particular words.) From the data in this study and in those re- 
viewed, it appears that success on either task is only vaguely related 
to the kind of associational fluency tapped by the markers employed. 
It must be borne in mind, however, that scores on these tests repre- 
sent the number of acceptable responses on à highly controlled task. 
Before the hypothesis that Cloze and RIP performances are related 
to fluency in the production of associations only in а rather trivial 
sense can be substantiated, studies are needed where the latter 
variable (flueney) is redefined, say, in terms of quantity of re- 
sponses on a free association task. 
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MUSICAL ABILITY AND THE DRAKE 
MUSIC MEMORY TEST* 


LAWRENCE R. GRIFFIN anp RUSSELL EISENMAN? 
Temple University 


Tun manual for the Drake Music Memory Test reveals that 
the test was designed to provide measures of musical aptitude and 
evidence regarding an individual's potential for a successful career 
in music. The manual also indicates that “Drake scores are highly 
predictive of success in music courses and music schools. . . .” The 
manual, however, does not report on studies performed to sub- 
stantiate these comments. 

The present study was carried out in an attempt to provide 
validity data for the instrument, using nonmusicians, music stu- 
dents and successful musicians. It was hypothesized that (a) Drake 
Music Memory Test scores (Form A) would be positively related 
to grades received in music courses, (b) professional musicians 
would have a significantly higher mean score than would music 
students, and (с) musie students would have a significantly higher 
mean score than nonmusicians. 

Method—Subjects. One hundred thirty-seven subjects were 
divided into three groups: 

Nonmusicians—Sixty-seven graduate and undergraduate stu- 
dents tested in psychology classes. Thirty nine were females and 
28 were males, their ages ranging from 20 to 40 years. 

Music Students—Forty-six music majors from Mastbaum Tech- 
nical High School in Philadelphia, Pennsylvania were tested. These 
students had indicated a desire for a career in music. 

"Address requests for reprints to Lawrence R. Griffin, Department of 
Psychology, Box 393, Temple University, Philadelphia, Pa. 19122. 


2 Appreciation is extended to Leona Aiken and Jay В. Efran for their 
assistance in the data analysis, and manuscript preparation. 
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Professional Musicians—Twenty-four members of the 
Federation of Musicians (AFL-CIO), who were self-supp 
during the previous three year period through musical perfo 
were tested individually in various nondisruptive settings conv 
to both the subject and experimenter. Approximately one-hi 
the group were members of nationally known musical groups 8 
as the Glen Miller Orchestra, the Ray Anthony Orchestra, and 
Woody Herman Orchestra. They ranged in age from 20 to 40 
with one female and 23 males in the group. E" 

Materials and procedure. The subjects were administered Fo 
А of the Drake Music Memory Test. This portion presented 
two-bar melody which the subject compared from memory w 
other versions of the melody. He indicated whether he heard 
change and, if so, what kind of change had taken place. 1 
test had objective scoring with raw scores converted into 
centile ranks. The items were originally based on an analy 
skills shown by successful performers in various fields of mui 
Items had been retained in terms of their correlation with ot 
available tests and instructors’ ranking of students’ "innate ; | 
ical ability (Drake, 1933) ." E 

Results. The raw score means and standard deviations and i 
percentile score medians and semi-interquartile ranges are p 
sented in Table 1. For the raw scores (number right) a one-¥ 
analysis of variance yields an F of 5.27, which is significa 
(p < 11). ў 

Tests for the difference between all possible pairs of means 
made through the use of the Newman-Keuls procedure. The pro- 
fessional musician group obtained a significantly higher п 


TABLE 1 
Means, Standard Deviations, Ranges, Medians, Quartiles and Semi-Interg 
Ranges for All Groups 
Raw Scores Percentile 
Standard 

Means Deviations Ranges Medians ©, Q: 

Nonmusicians 30.96 6.98 1-98 31 15 62 
Music Students 32.58 9.48 18-97 | 80 54 90 


Professionals 36.63 6.59 5-93 36 16 53 _ 
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score (р < .05) than did the music majors and a significantly 
higher mean score (p < .01) than did the nonmusicians. The 
difference in mean scores between music majors and nonmusicians 
was not significant. The ordered means suggest increasing ability 
from nonmusicians through high school music majors to profes- 
sional musicians. 

The percentile norms for the Drake were devised to correct for 
musical training and age; they are given for ages seven through- 
23 plus, for both “music students” (those with more than five years 
of training) and “nonmusic students” (those with less than five 
years of training). 

Table 1 lists the raw score means and standard deviations and 
the percentile score medians and semi-interquartile ranges for 
all three groups. It should be noted that the median for the music 
majors is higher than the median for the professionals, even though 
their mean raw scores are significantly different in the reverse 
direction. This outcome suggests that the percentiles overcorrect 
for age and thus yield misleading results. An older professional 
group was not used in the original standardization. This reversal 
when using the percentiles for older musicians has probably not 
been previously noted. 

The predicted positive correlations between performance in 
music courses and Drake raw scores were all significant at the 
01 level. The highest correlations were between music course 
grades and Drake raw scores; .58 and .50 for band and theory 
courses respectively. 

Discussion. The stated hypotheses (b) and (c) were confirmed: 
the professional musicians did score significantly higher (in terms 
of raw scores) than either the music students or nonmusicians. 

The Drake test manual also states that “Drake scores are highly 
Predictive of success in music courses and music schools (Science 
Research Associates, 1954, p. 10).” This statement was tested as 
an hypothesis and confirmed. For the music major group, the scores 
Were significantly correlated with school grades in both band and 
music theory. These results show that the test should be able to 
discriminate between the poorest and strongest prospects for suc- 
cess in profiting from musical instruction, although it may not be 
Particularly useful in individual counseling. The test does appear 
to be useful as a gross indicator of success in music courses, partic- 
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ularly when the raw scores are used and interpreted in a general 
way. 

Percentile norms for Form A are presented in the manual for 
music students for each form and section of the test with the total 
number of cases equal to 5894 for the memory section. No norms 
are given for professional musicians, or for other special groups. 
The results of this study suggest that for older subjects the use 
of the percentile norms is misleading in that subjects with superior 
ability are scored as inferior because of overcorrelation for 
age at the upper end of the scales. Raw scores as compared with 
percentile scores yield more predictive estimates. Specialized per- 
centile norms can also be developed by individual users. 

Summary. The purpose of this study was to determine whether 
scores on the Drake Music Memory Test (Form A) predict suc- 
cess in music courses and to ascertain whether they render evidence 
of an individuals’s potential for a music career. There were three 
groups of subjects including 67 nonmusicians, 46 music majors, 
and 24 professional musicians. The results indicated that the 
professional musician group did have a higher mean raw score than 
did the music student and nonmusician groups. It was shown that 
the Drake scores were positively related to grades received in 
music courses for the music student group. 


REFERENCES 
Drake, R. M. Four new tests of musical talent. Journal of Applied 
Psychology, 1933, 17, 129-142. 


Science Research Associates, Manual the Drake Musical Apti- 
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A VALIDITY STUDY OF THE ACQUIESCENCE SCALE 
OF THE HOLLAND VOCATIONAL 
PREFERENCE INVENTORY 


5. S. JACOBS 
University of Pittsburgh 


“RESPONSE style" is a term which has been variously and in- 
consistently defined. In agreement with distinetions drawn by 
Rorer (1965), this note will consider some of the difficulties in- 
volved in the assessment of an acquiescent response style; the 
tendency to agree with or accept items, particularly of the forced- 
choice or true-false variety, without regard to content. 

The Holland Vocational Preference Inventory (VPI), (Holland, 
1965), а personality inventory composed entirely of occupational 
titles, contains an Асашезсепсе (Ac) Scale, which is an attempt 
at the standardized measurement of this variable. 

It appears that the VPI offers & scale which could produce data 
useable as a covariate or a “leveling” variable in research in- 

| volving data possibly contaminated by Ss acquiescent tendencies. 

However, even a cursory examination of the literature concern- 
ing the problem of response styles, sets, and biases suggests the 
solution is not so straightforward as it might appear. The most 
consistent results seem to indicate that the antecedents and cor- 
relates of acquiescence are only partially understood; Rorer, in 
discussing the problem of identifying sources of variance in MMPI 
scales intended to measure acquiescence, concluded: “АП are sub- 
ject to the same equivocal interpretation as regards the relative 
importance of set, style, and content in accounting for scale scores 
(р. 141)" It is the position of this note that the same comment 
applies to Ac scale scores. 

The study was designed to examine the validity of the Ac 
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scale scores by examining the effect of specific test characteristics 
on a theoretically general aspect of personality. These character- 
istics were: (a) wording of the VPI directions, (b) format of the 
response blank, and (c) inventory content. 

Method. Subjects were 40 male and 49 female students enrolled 
in an urban high school, constituting the enrollment in five sections 
of an elective course in introductory psychology, predominantly 
in their junior or senior years. 

All Ss were administered the entire VPI, using the standard 
procedure outlined in the Manual. Following this, the Ss partici- 
pated in a mock extra-sensory perception (ESP) experiment as 
follows: 

Subjects were told by E that a randomly selected VPI response 
blank was contained in а manila folder held by E. Actually, the 
envelope held a blank sheet of paper. Subjects were instructed 
to concentrate as Е called off item numbers one through 30 at 
three-second intervals, and to record the first response (“У” for 
“Yes,” “N” for “No,” or blank) that occurred to them. 

The available evidence indicates that this procedure probably 
yields the most valid assessment of a style (Rorer, 1965, p. 151). 
In conventional terms, the concurrent validity of Ac scores was 
determined; the Ac score is simply the total of “Y” responses to 


the first 30 VPI items. ESP acquiescence scores were similarly 


defined. 

One week later, unannounced, the second phase of the study was 
carried out. A quasi-random assignment of Ss to treatments was 
employed; since the verbal directions were alike for pairs of treat- 
ments, two classes were randomly assigned to one treatment-pair, 
the remaining three classes to the other treatment-pair. Subjects 
within classes were randomly assigned to the treatment-pair as- 
signed to that class. Since the randomization procedure is still 
“lumpy,” as opposed to complete random assignment of Ss, the 
analysis of covariance was used to increase precision, with 88 
original Ac scores as covariate. 

The four treatments were as follows: 


1. retesting with conventional VPI materials, hereafter Te- 
ferred to as the control group or group 88; verbal directions 


were read from a copy of a conventional VPI test blank; 


2. retesting with a conventional VPI test blank, with a modified 
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response sheet, hereafter referred to as group SM; verbal 
directions same as (1) above; 

3. retesting with a modified VPI test blank, with a conventional 
response sheet, hereafter referred to as group MS; verbal 
directions were read from a copy of the modified VPI test 
blank; 

4. retesting with a modified VPI test blank, with a modified 
VPI response sheet, hereafter referred to as group MM; verbal 
directions as in (3) above. 


The modification to the response sheet consisted of reordering | 
the positions of the "Y" and “N” options randomly, and reword- 
ing the directions “Blacken Y" for "Yes," or “N” for “No” to 
read “Blacken №” for “No,” or ^Y" for “Yes.” 

The modifications to the test booklet consisted of reordering the 
conventional directions which read: 


1. Show on your answer sheet the occupations which interest 
or appeal to you by blackening У for “Yes”; 

2. Blacken N for “No” for the occupations you dislike or find 
uninteresting ; 

3. Make No marks when you are undecided about an occupa- 
tion; so that the positions of the first two sentences (1. and 


2.) were reversed. 


The modifications employed are an adaptation of the “reversal 
approach” employed by Rorer and Goldberg (1965) and Chapman 
and Bock (1958). 

Results. An analysis of covariance on the mean Ac scale scores 
of the four groups resulted in a Fa of 3.71 (p < 025). Scheffé’s 
test on the differences among the adjusted means indicated group 
MM was significantly different from groups MS, SM, and SS, which 
were not significantly different from each other. (See Table 1.) 

To estimate the reliability of the content-free ESP measure of 
acquiescence, two randomly selected classes (N = 48), were read- 
ministered the mock ESP experiment, following the collection of 
second-phase data. The test-retest reliability was .75, indicating 
a stable tendency to respond in a particular manner in the absence 
of content. The correlation between initial VPI Ac scores and the 
initial ESP acquiescence scores was found to be .15, not signifi- 
cantly different from zero. 
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TABLE 1 
Result of Scheffé’s Test on Adjusted Means 


Experimental Group 
SS (N = 14) SM (N = 19) MS (N = 27) MM (М = 29) 
10.0 10.5 9.9 8.0 


Note.—Underscored adjusted means are not significantly different (p > .05). 


Conclusion. It seems quite evident that, rather than tapping a 
generalized response style of acquiescence, VPI Ас scale 
responses are strongly moderated by the combination of the im- 
plications and influences present in the VPI directions and re- 
sponse sheet format. It is also discouraging to note that apparently 
Ac scale scores are “content-bound”; that is, it appears they 
largely reflect the influence of the occupational titles rather than 
any generalized tendency toward agreement. 

In conclusion, the use of Ac scale scores as a covariate or 
stratifying variable may be defensible within the context of the 
total VPI, since it may be argued that all the scales are similarly 
affected by the directions, test format, and acquiescent tendencies, 
if such exist. However, the validity of the data is questionable within 
the context of the VPI, and certainly highly suspect if used in re- 
search where a covariate or stratifying variable tapping general 
acquiescent tendencies is needed. 

Summary. The results of this study indicate that VPI Ac scale 
scores are suspect as indicators of the response style of acquies- 
cence. It appears cues and implications embedded in the inven- 
tory’s directions, format, and content exert a significant influence 
on measured acquiescent tendencies. 
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A COMPARISON OF THE SELF-REPORT AND THE 
OBSERVED REPORT IN THE MEASUREMENT OF 
THE SELF-CONCEPT: IMPLICATIONS FOR 
CONSTRUCT VALIDITY! 


JOAN J. MICHAEL 
California State College, Long Beach 
ALEXIA PLASS 
Huntington Beach Union High School District 
YOUNG B. LEE 
University of Southern California 


Problem. This investigation was primarily concerned with а 
comparison of two methods of measuring the complex self-concept 
construct: (1) the self-report of students (SR) and (2) the recorded 
perceptions of trained observers (OR). Attention was specifically 
focused upon a comparison of the SR of 30 sixth-grade pupils 
with the OR of two of their teachers with respect to each of the 
48 items on a self-concept scale adapted from Coopersmith’s (1967) 
Self-Esteem Inventory and with respect to each of four scales 
representing four hypothesized subconstructs: (a) mental health, 
(b) personal self, (c) academic self, and (d) social self. A lack of 
congruence between the SR and OR measures would imply that 
the interpretability and validity of constructs in the affective 
behavior evaluated by a self-concept scale would depend upon the 
mode of measurement and the sources of data. Answers to the 
following questions were sought: 

1. What differences if any existed between SR measures and OR 
Measures for each of the four 12-item scales: (a) mental health, 
(b) personal self, (с) academic self, and (d) social self? 


1 Paper presented at the annual meeting of the National Council on Meas- 
urement in Education in New York City, February 1971. 


481 


482 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


2. What differences if any existed between SR measures and OR — 


measures for each of the 12 items within each of the four scales 
hypothesized to represent the constructs of: (a) mental health, 
(b) personal self, (с) academic self, and (4) social self? 


3. What differences if any existed between SR and OR data in 


clusterings or typologies on each of the four hypothesized con- 

structs for 30 students, who were intercorrelated over 48 items? 
Methodology. The responses of 30 pupils and their two teachers 

to the 48-item scale, which was comprised of four subscales, were 


subjected to analyses of variance, and post hoc tests when ap- | 
propriate, to determine whether there were differences in propor- _ 


tions of responses. Finally, a Q factor analysis was executed for 
each of the three matrices of interrelations. 

Results. The extensive data for this study (which are available 
upon request from the first mentioned writer) yielded the following 
major findings: 

1. Differences existed between SR and OR measures for the sub- 
scales of mental health, personal self, and social self, but not for 
academic self. 

2. Differences existed between SR and OR measures for nine of 
the items: (a) two, for mental health, (b) two, for personal self, 
(c) none, for academic self, and (d) five, for social self. 

3. Different factor patterns existed between SR and OR data 
in clusterings on each of the four hypothesized constructs. 

Conclusions. The lack of congruence between SR and OR mea- 
sures, especially on the hypothesized social self factor and to & 
lesser extent on the hypothesized mental health and personal self 
factors, suggested that evaluations afforded by a self-concept scale 
would differ in these areas depending upon the mode of measure- 
ment. However, it would seem that any measurement by a self- 
concept scale in the academic domain in a school setting could be 
expected to elicit comparable results regardless of whether the 
student or the teacher-observer did the measurement. 

Discussion. It would appear that in the development of tests 
of self-concept, attention needs to be given to whether an outside 
observer, such as a teacher or psychologist, will be making the 
evaluation or whether the individual student himself will be fur- 


nishing the data. In any event, in those more subjectively oriented 


areas where the teacher or psychologist has a limited opportunity 
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to observe and has a highly curtailed personal relationship with 
the student, quite different constructs may be expected to be 
present. Thus, it seems quite apparent that any test publisher 
who expects to furnish construct validity for his test should report 
separate data relative to the method by which the data were ob- 
tained. One can hardly expect in the instance of adult observers 
and child respondents that the Campbell and Fiske (1959) mul- 
titrait-multimethod validation would yield similar constructs for 
the self-perceptions of students or for the perceptions that teachers 
hold regarding their students. 

Additional support for many of these statements was also forth- 
coming from the Q factor analyses that failed to yield either 
interpretable or matchable constructs either between self-report 
data and observed report data or between two sets of observed 
report data. Admittedly, the failure of the Q analyses could be 
due in part to difficulties of communality estimation or to the 
operation of unidentifiable sources of bias as in response sets. 

In any event, the statistical analyses of the data strongly in- 
dicated that the self-concept is a complex entity made up of many 
constructs, the validity of which is dependent upon the measure- 
ment procedure. Thus, support has been obtained for the Cooper- 
smith argument (1967) for a combination of observer evaluations. 
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RIVED FROM THE PERSONAL ORIENTATION 
INVENTORY: A REPLICATION AND 
REFINEMENT STUDY 


VERNON J. DAMM 
University of Portland 


Usine а sample of 95 male and 113 female high school subjects 
Damm (1969) showed that an overall measure of self-actuali- 
zation, which was defined as a total score on the Personal Orien- 
tation Inventory (POI) (Shostrom, 1966), could be obtained either 
by using the Inner Directed (1) scale singly, or by combining 
the raw scores of the Time Competent (Tc) and I scales. No 
additional advantage was found by converting raw score scale 
distributions to standard score scale distributions for combining 
subscales. 

That study was replicated on samples from older populations, 
and comparisons between males and females were made. Thus, the 
purpose of this investigation was to report for new samples the 
intercorrelations among selected subscales of the POI with the 
view of determining the feasibility of substituting one or two 
scales for the overall scale (composite of 12 scales) of the POI. 

Method. Three additional populations sampled were: 205 male 
and 206 female Oregon State University freshmen students |. 
enrolled in a Personality and Development Course; 139 female 
nursing students in the sophomore, junior, and senior classes of 
the University of San Francisco and University of Portland 
Schools of Nursing; and 75 male and 31 female adult students 
enrolled in а lower division class of the Division of Continuing 
Education, Portland, Oregon. 

Four methods for deriving an overall measure of the defined 


OVERALL MEASURES OF SELF-ACTUALIZATION DE- 
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instrument of self-actualization were applied to the separate sexes _ 
of the separate samples. They were: H 

1. Standard Score: Average Overall Scale (S: AOS), by converting | 
the raw score distributions to standard score distributions for each 
of the 12 scales, then combining the 12 standard scale scores for - 
each subject; к j 

2. Standard Score: Inner Directed-Time Competent Scale 
(3:1 — Те), by combining the standard scores of the J and Tc _ 
scales for each subject; 

3. Raw Score: Inner Directed- Time Competent Scale (R:I — То), 
by combining the raw scores of the Г and Тс scales for each subject; 

4. Raw Score: Inner Directed Scale (R:1I), by using the raw score 
of the I scale for each subject. 

Results. Table 1 contains the product moment correlation coeffi- P 
cients of each of these overall scale interscale relationships for each $ 
of the populations sampled, broken down separately by sex, and _ 
then combined as a total М. The coefficients of the original sample _ 
of high school subjects are also included. The large magnitudes of 
the coefficients of correlation are due in part to a substantial over- _ 
lapping of items in the various scales cited. 

For the total sample of subjects (N = 656), the R:I — Tc/R:I ү 
combination yielded a coefficient of .98, explaining 96 per cent of 
the variance between them; the S:I — Tc/R:I — Tc combination e 
showed a coefficient of .96, uM for 92 per cent of their com: 
mon variance; and the 5:1 — Tc/R:I combination furnished 8 
coefficient of .88, describing only 77 per cent of their common vari —— 
ance. Since the R:I scale is identical in ranks to a standard score _ d 
scale of the same values, nothing would be gained by converting — 
these raw scores into standard scores. ji 

For the R:I — Tc and S:I — Тс scales, which are a combination 1 
of the I and Тс scales, it must be empirically determined whether — 
combining the raw scores would afford differential results from those n 
obtained in combining the standard scores. It was found that БЛ | 
and R:I — Tc yielded practically identical predictions and that di 
R:I — Tc compared to 5:1 — Тс accounted for about 92 per cent | 
of the variance between them. Thus, the coefficient of .88 P 4 
5:1 — Tc/R:I indicated that combining the standard score scale | 
values of the I and Tc scales reduced the magnitude of the relation- - И 
ship between them and the 2:1 scale—an outcome that also held | 
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in predicting the S:AOS scale, in that both R:I — Tc and Ril 
did predict significantly more accurately at the .01 level (Hendrick- _ 
son and Collins, 1970) to S: AOS than did 5:1 — Tc. This observa- 
tion was further supported by the fact that R:I — Tc did predict 
significantly more accurately to R:I than did 5:1 — Tc; R:I — Т 
to a significantly greater degree to 8:1 — Тс than did R:I; and ВИ 
to a greater extent to R:I — Tc than did 5:1 — Тс. These findings 
indicated that combined raw score scale values were better predic- 
tors of the other overall scale values than were combined standard 
score scale values. Furthermore, since the R:I — Тс scale was. 
more highly associated with the 5:1 — Тс and the 8:408 scales 
than was the R:I scale, and since the R:I — Тс scale included all | 
the 150 POI items, whereas the В :1 scale included only 127 of be 
it seems advisable to use the R:I — Тс scale rather than any of the 
other overall scales as a single measure of self-actualization. 
The only significantly different coefficients at the .01 level be- 
tween the original high school sample (N = 208) and the new | 
total sample of college students was for the S: AOS/R:I — Tc com- - 
parison. These coefficients were .97 and .95, respectively. The only - 
significantly different coefficients for the combined male (N — 280) | 3 
and the combined female (N — 376) samples of the replication. 
study was for the S: AOS/R:I comparison. These coefficients were | 
:95 and .93, respectively. For all comparisons excluding 8:408 as 
one of the scales, there were no significant differences between the - 
combined female coefficients (№ = 376) and the combined malê 
coefficients (N = 280), or between the coefficients for the or iginal _ 
high school sample (N = 208) and the total sample (N = 656). _ у 
Conclusions. The conclusion of the original study was that an- 
overall measure of the POI could probably be best obtained by. 
using either ће R:I or the R:I — Тс scale. The present study in- 
dicated that (1) among those scales investigated, the R:I — TC _ 
scale was the best predictor of an overall measure of the POI, (2) 1 
that the finding could be generalized beyond the high school popu- 4 


lation to certain college populations, and (3) this result obtained _ 
equally for the sexes. f 
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PREDICTION OF QUALITY POINT AVERAGES FROM 
PERSONALITY VARIABLES 


JERRY B. AYERS лхо MICHAEL Е. ROHR 
Tennessee Technological University 


Тнв role that personality plays in predicting the academic 
achievement of college students has been explored by several 
investigators. Cattell, Eber, and Tatsuoka (1970); Holland (1960) ; 
Huckabee (1968); and the IPAT Bulletin (1962) have reported 
studies involving the relationship of certain personality variables 
to grades in courses and overall achievement in college. How- 
ever, there is a dearth of data for certain subject matter areas 
and for students enrolled in regional state universities. 

The present study was designed to examine the validity of 
the Sixteen Personality Factor Questionnaire (16PF) in predict- 
ing the achievement of college sophomores majoring in Education, 
Engineering, and Business at a regional state university and to 
construct a regression equation to predict overall college quality 
point average. 

Procedure. The subjects for this study were 415 students en- 
rolled in the third quarter of their sophomore year at Tennessee 
Technological University and majoring in Education (N — 149), 
Engineering ( = 88), апа Business (N = 178). The Sixteen 
Personality Factor Questionnaire, Form A, 1962 Edition, which 
Was used to assess personality, was administered by the investi- 
gators. Quality Point Averages (QPA) were obtained from each 
Subject's permanent record. Intercorrelations of all variables were 
obtained, and stepwise regression equations were computed for 
each subgroup and the Total group. 

Results and discussion. The correlations of the variables with 
QPA are found in Table 1. An examination of the personality 
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TABLE 1 


Correlation of Variables with Overall Quality Point Average for Total Sample - 
and for Education, Engineering, and Business М. ajors* 


Total Education Engineering ^ Business. 

Variables (N = 415) (N = 149) (N = 88) (N = 178) 

Sex 10* 18* 06 2 

(Personality Factors) 

A —03 —06 —10 03 
B 09 41** 07 01 
с 03 01 05 03 
Е 02 —02 08 —06 
Е —03 —06 07 —14* 
G 11* 23** 10 17* 
н —06 02 16 01 
1 —05 03 —14 —02 
L —14** —16* 03 —1% 
м 05 04 06 —00 
N —01 03 —09 —07 
о —05 —31** 20 —09 
Q 10* 15 24* 01 
Q: 00 08 04 —07 
Qs —03 07 —16 08 
Qi —01 —07 17 —16 


ecimals omitted 

* Significant at the .05 level. 

> ficant at the .01 level, 
profile for each of the groups showed all scores in the averag 
Tange. Sex was a significant correlate of success for the Total group 
and for the Education and Business subgroups (Sex was codel 
Such that male — 1 and female — 2). Female students, in general 
had higher QPA's than males in these groups. ] 

The significant personality correlates for the Total group 
cluded Factors G(.11), L(—.14), and Qı(.10). These results 
consistent with the results of other studies. High achievers, 


cation. The successful Education student can be described as in- E 
telligent (B), persistent (G), adaptable (L), and self-secure (0 | 
Factor Q:(.24) was the only significant correlate of success fo 
Engineering students. Based on this factor the successful student ' 
can be described as critical, analytical, and experimenting. A 
cording to Cattell, et al. (1970) this characteristic is evid 

in successful scientists and engineers. Factors F(—.14), G(.1 
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and L(—.17) were significant correlates of success for Business 
students. Therefore, high achieving Business students can be char- 
‚ acterized as serious (F), persistent (G), and adaptable (L). 

Multiple correlations and standard errors of estimate of 
correlation coefficients appear in Table 2. The first row shows the 
results for stepwise regression equations utilizing only the 16PF 
Factors, while the second row shows results for equations that 
include sex. F ratios for testing the incremental effect of adding 
sex to the 16PF variables were significant at the .01 level for the 
Total group and the Business group. Sex of the students was 
highly related to success in college as measured by QPA. Fe- 
males were found to be more successful than males, The regression 
equation for prediction of QPA for the Total group was as follows 
(16PF factors must be used in terms of Sten Scores) : 


ФРА = 3.16 + .18 SEX — .01A + .02B — .04С — .06Е — .02Р 
+ .04G — 02H — .0Ш — .05L + .02M — .030 


+ .030, — .030, — .03Q; + .010, 

The results of this study both support and fail to support 
the validity of the 16PF in academic prediction. The correlation 
patterns of the 16PF for predicting achievement, which describe 
fairly accurately the difference between high- and low-achieving 
students, support the interpretations given by the authors of the 
16РЕ. However, the equations based only on the 16PF variables 
account for only a limited amount of the total variance in the 
prediction equation. It is the belief of the authors that the 16PF 
should not be used alone in predicting achievement in the academic 
areas. A more valid and reliable prediction equation can be ob- 
tained by a combination of cognitive and personality variables. 
Cross-validation of results on several additional samples is needed 


TABLE 2 
Comparison of Multiple Correlations and Standard Errors of Estimate of R* 
Total Education Engineering Business 
В SE Е SE Е SE _ R SE 
16PF Factors .50 256  .50 .68  .57 49  .39 .45 
16РР Factors Sex 58 7 1 .63 .6 45  .48 43 


* In terms of Sten Scores. 
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to establish the stability of the weights for prediction of succe 8 
samples of individuals pursuing different majors. 


REFERENCES 


of Educational Research, 1968, 2, 221-25. Г | 
ТРАТ. ТРАТ Information Bulletin #4. Champaign, Ш.: Institu 
for Personality and Ability Testing, 1962. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1972, 32, 495—497. 


A NOTE ON RESULTANT ACHIEVEMENT MOTIVATION 
MEASURES AND SCHOLASTIC ATTAINMENT 


FRANK H. FARLEY 
University of Wisconsin 


Tur concept of resultant achievement motivation (RAM) 
(Atkinson, 1957) as need-to-achieve minus fear-of-failure has 
been influential in recent motivational theory and research. Ме- 
hrabian (1968, 1969) has recently developed self-report scales 
of RAM for males and females to overcome many drawbacks of 
Thematic Apperception Test (TAT) measures. Where the utility 
of such measures to education is concerned, their relationship to 
scholastic achievement is of central interest. Weiner, Johnson, and 
Mehrabian (1968) have reported no significant relationship of 
scholastic performance (course examination) and Mehrabian's 
RAM scale or a RAM measure utilizing TAT scores with a male 
sample. Raffini (1971) reported similar results with the Mehrabian 
measure on males, but also reported that with females such 
scholastic performance was significantly better in high over low 
RAM subjects. Kestenbaum and Weiner (1970), however, cited а 
significant correlation between а RAM measure for children similar 
to Mehrabian’s adult forms and а standardized reading achieve- 
ment tests for males, with high RAM being associated with supe- 
rior achievenemt, but no significant correlation for female subjects. 
Using Mehrabian’s scale with females the latter result was also 
obtained by Farley and Mealiea (1971). Farley and Truog (1971), 
employing a different measure of RAM, likewise found no sig- 
nificant relationship of RAM and achievement. 

One possible difficulty with the foregoing studies, except for 
those of Farley and Truog (1971) and Kestenbaum and Weiner 
(1970) both of whom used standardized achievement tests, might 
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lie in the choice of measure of academic achievement—that 
a single final examination score from one course. Such a 
may be strongly subject to idiosyncratic influences of course ei 
tent, test format, or test administration (most of the studie 
differed in length and format of the examination). Clearly, a 
superior estimate of scholastic achievement would be the cumula- 
tive grade point average (GPA). 
Purpose and procedure. The purpose of this study was to deter- 
mine the criterion-related validity of RAM (Mehrabian scale) 
relative to a criterion of GPA for each of two samples of 30 fe- 
males and 30 males. The subjects were obtained voluntarily from 
an undergraduate educational] psychology class. A product-mo- 
ment correlation between the predictor and criterion variable was 
obtained for each sample. dE 
Findings. For females the correlation of RAM and GPA was. | 
—.30 (NS), while the comparable correlation for males was, 12. 
(NS). Neither correlation was significant. In addition, the cor- 
relation based on females was opposite in sign to the comparable | 
result reported by Raffini (1971). i 
Conclusions. On the basis of this analysis, and taking into” 
account the preponderantly negative findings reported above, there 
seems no compelling reason to reject the null hypothesis where the 
relationship of the present resultant achievement motivation mea- 
sures to scholastic achievement is concerned. ve 
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Jeremy М. Anglin. The Growth of Word Meaning. Cambridge: 
M.LT. Press, 1970. Pp. xvi + 108. $5.95. 


This book is a research monograph based primarily on the 
author’s doctoral dissertation at Harvard. It is a compendium of 
seven studies, mainly cross-sectional, involving 20 words. The 
first two were sorting experiments. The third was a free recall 
experiment. The fourth reports the results of a free association 
experiment. The fifth, a frames (4 sentence frames for nouns, 
verbs, adjectives, and prepositions with a missing word in each) 
experiment, was designed to see if children could be lured into 
respecting the part of speech distinction. The sixth experiment 
assessed the capacity of individuals of different ages to extract 
and make use of the relations that exist among the 20 words. 
The last experiment attempted to determine the extent to which 
subjects of various ages could indicate whether pairs of words 
(from the 20) shared communality that make them similar in 
meaning. It is interesting to note that each study continually 
provided issues for the other studies reported in this book in an 
evolving fashion. 

The design of the reported experiments were based on four as- 
sumptions regarding the nature of words. First, a word is “а con- 
tainer of meaning with a primary function in organizing the world 
of experience to make it conceptually manageable.” Inherent in 
this assumption is the emphasis that there is communality between 
words and their features in contrast with Katz’s (1966) notion 
of distinction between words and their semantic markers (features). 
Second, words possess hierarchical relations in such a way that 
subordinate and superordinate categories can be established 
among them. Third, sentences provide a source of verbal concepts. 
The meaning of a word can be deduced from the utterances and 
Sentences of a language and similarity of meaning is systemati- 
cally related to privileges of occurrence. Fourth, “a word is a 
social phenomenon, a part of the culture, and relatively useless 
unless it means the same thing to different speakers of the lang- 
uage.” These four assumptions (biases) are reflected in the re- 
Ported studies while selecting experimental tasks, constructing 
the set of words, and analyzing the data. р 

The selection of 20 words was based on the author's intuitions 
about word-relations and the desire to obtain a set of words 
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such that the shared features could be arranged in a nest (sub. 
ordinate and superordinate hierarchy). It was purported to ob 
tain the degree of abstractness of the equivalence relation 
tween words (whether any two words share a feature or a set 0 
features). The meaning shared by two words was estimated in 
terms of proximity, the number of argreements among responden 
This estimate was task dependent, viz., instructions and individus 
differences were possible confounding variables. However, some 
findings were replicated using different instructions. f 

Hierarchical cluster analysis was used to depict the appreciation 
of class inclusion relation among words. Multidimensional scaling 
techniques were also used to determine the extent of idiosyncrasies 
among the subjects in tasks involving use of the semantic relation 
among words. | 

The results of the studies reported have, by and large, indicated 
that the growth of word meaning follows the concrete-abstract 
progression. Very young children tended to be idiosyncratic in t 
organization of words and whenever there was uniformity 
their responses it could be ascribed to the thematic princip 
Adults have tended to be more consistent in grouping words t à 
belong to the same conceptual category. Between these two 
extremes there appears to be a gradual syntagmatic-paradigma! 
shift (responses in free associaion which are predominantly of 
a different part of speech are called syntagmatic, whereas response 
which are of the same part of speech are called paradigmatic) 
The generalization hypothesis as opposed to the differentiatio! 
hypothesis has been borne out by the data. The generalization 
process in this context is characterized by development in ¢ 
dren by which they first realize the similarity among si 
groups of words and only later appreciate the similarity amon 
increasingly large classes. The contrasting position, the diff 
entiation process, is viewed as growth in ability to make fine 
and finer discriminations. 


(assumptions) and preconceptions or they are in fact indication 
of the growth of word meaning. The obtained data were conve 
in clustering experiments, into distances on a priori theoretic 
preconceptions. For example, in the first clustering experiment 
measure of distance between each observed clustering and theo 
ical clustering into four parts of speech was computed for еу 
adult and for every child. It is possible that the results are sim! 
an artifact of the chosen theoretical framework. The part of spe 
distinction was not clear to most of the children and a few of 
adults, as indicated in a part of the second sorting experim 
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when subjects were asked to sort the words according to the part 
of speech distinction. 

However, the theoretical bias does not mar the quality of this 
book. The presentation is lucid. However, there are а few incor- 
rect table references in the text which make it a bit awkward to 
follow. For example, Tables 5 and 6 on р. 67 should be Tables 
4.1 and 4.2. 

This book should have a general appeal to the researchers in 
psycholinguistics and developmental psychology. It also adds to 
our understanding of individual differences and should provide a 
reference source for those interested in differential psychology. 
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David P. Campbell. Handbook for the Strong Vocational Interest 
Blank. Stanford, Calif.: Stanford University Press, 1971. Pp. 
х + 516. $20.00. 


At the outset, it should be mentioned that this Handbook by 
Campbell is an important contribution to the literature on the 
SVIB. It provides an account of its development and a compre- 
hensive reference to the currently available technical information. 
The technical information includes the rationale and history of 
the occupational scales, the basic interest scales, and the admin- 
istrative indices. Discussions and data on men's and women’s 
forms are presented separately. 

In the discussion of the occupational scales Campbell presents 
examples of items which have positive and negative weights on 
scales, popular items, intercorrelations between scales, overlap be- 
tween each criterion group and men-in-general scores, and mean 
standard scores for each criterion group on each occupational and 
nonoccupational scale. Йй 
. The procedures for selecting items scored on each of the basic 
Interest scales are discussed. Items were selected from the occupa- 
tional scales on the basis of item intercorrelations and of cor- 
telations among basic interest scales and occupational scales. 
Campbell indicates the recognition of a need for a system contain- 
ing fewer scales than the occupational scales led to in the develop- 
ment of the Базе interest scales, but he cautions that they are 
intended to be used with the occupational scales. 

The nonoccupational scales are presented as examples of the use 
of empirical data to develop additional scales which may increase 


504 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


the scope of the SVIB. Campbell emphasizes the difficulty of 8] 
propriate use of these scales and the need for continued researe 
into their effectiveness. The Academie Achievement Scale has show 
moderate effectiveness in predicting academic success. The Ma 
culinity-Femininity Scale contributes to understanding of th 
client, but only well trained counselors should use it because 0 
the danger of misunderstanding of the meaning of the scale. Th 
Age-Related Interests, Diversity of Interests, and Manageria 
Orientation Scales are included currently to encourage the colle 
tion of more data. The Administrative Indices are designed to dê 
tect procedural errors, such as use of the mismatched booklet 
and answer sheets, omission of some items, or faking. 

The Handbook also includes extensive data on reliability and 
validity. Most of the reliability results are test-retest with interva 
ranging from two weeks to 30 years. Validity data are based or 
current studies including analysis of records from a variety 0 
groups used in longitudinal studies. Additional validation evidene 
is based on discrimination between occupational groups. Although 
many of these results are available in the literature, most, users 
will appreciate the convenience of the extensive compliation of 
data in a single source. The information on the long term follow- 
up studies of several occupational groups support the usefulnes 
of SVIB in counseling, but the need to incorporate information 
other than interest profiles in counseling is also stressed. An in 
teresting subsection presents mean profiles of outstanding men № 
а variety of occupations ranging from astronauts and forme 
governors to former presidents of the American Psychologica 
Association and winners of the Nobel prize in science. | 

The direction of future research and development is indica! 
in the discussion of analyses designed to measure changes in 
terests in society between the 1930’s and the 1960’s, and in 
description of the projected revision of the Men’s SVIB. The 
form follows the same pattern as earlier editions, but additional 
occupational groups have been added and new men-in-general 
samples have been developed. 

The Handbook should be a standard reference for anyone us 
the SVIB in either counseling or research until accumulated 
search or changes in the test make a revision necessary. 
tables of intercorrelations and mean standard scores for occupa 
tional groups will be used by many readers, but the explanati 
are not dependent upon them. It would be unfortunate if som 
counselors using the test avoid the book because the mass of data. 
makes it appear too statistical for their comprehension. 3 

In a comprehensive book such as this the author must ша 
some decisions about organization. Campbell uses the organiz 
tion of the SVIB, presenting rationale, development, reliabili 
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and validity of each part of the profile as a unit. Some readers 
may wish he had presented all reliability data in one section and 
all validity information in another. However, the book is well 
indexed so that locating information on any topic will present no 
problem. 

The Handbook is the most recent product of the continuing 
program of research and publication to improve and update the 
SVIB and to increase the effectiveness of its use in counseling. 
Those using the SVIB in counseling or research and those interested 
in general developments will appreciate both the book and the 
indication of an intent to continue work on the test. 


Berry Corwin 
Department of Psychology 
East Carolina University 


C. Mitchell Dayton. Design of Educational Experiments. New 
York: McGraw-Hill, 1970. Pp. vii + 441. $10.95. 


What is needed in education and psychology is a design book 
similar to that written by Cox (1958) which would incorporate 
the ideas of Campbell and Stanley (1963) and of Bracht and 
Glass (1968) in emphasizing the planning and validation of ex- 
periments. The major focus of most current “design” books in 
education and psychology is on the statistical-analysis of experi- 
ments rather than on the design of experiments. Dayton’s text 
is not an exception in this sense. However, the author does a 
commendable job in providing a “workable compromise between 
exhaustive coverage of designs and the practical facts of their 
utility.’ The discussions concerning uses, limitations, and inter- 
pretations of the various analytical procedures are a welcome addi- 
tion to what is found in many other “formulation-based” texts. 

There are two specific areas in which this text overshadows some 
of its competitors. One is in the author's treatment of repeated 
measures designs (RMDs). In this chapter particular attention 
is paid to the problem of correlated experimental errors which often 
arises in experiments involving repeated measures. The testing 
Procedure is clearly indicated; however, it may be disappointing 
to some to find that the multivariate analysis alternative is not 
covered. The other especially fine chapter is on the analysis of covar- 
lance (ANCOVA). An excellent treatment of the cautions in ap- 
plication and the limitations and advantages of this technique is 
Provided. Since Dayton bases his covariance analysis on residual 
Scores as opposed to adjusted scores the prerequisite background 
In regression analysis is lessened somewhat. 

Besides the very readable discussions (as judged by students), 
especially in the chapters on RMDs and ANCOVA, the text has 
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other favorable features. One is the emphasis on analysis via 
orthogonal contrasts; this approach provides good preparation for 
more extensive study in data analysis. Another plus is the emphasis 
on the structural model and on sources of variation and expected 
mean squares to aid in determining appropriate test statisties— 
Cornfield and Tukey’s algorithm is presented early in the book. 
It is also good to see a text that explicitly states what hypotheses 
are being tested by what statistics—errors in a couple such state- 
ments are pointed out below. The symbolism and notation ap- 
pear to be orthodox and consistent throughout. 

The book is not without its weaknesses and drawbacks. One 
drawback, from an instruction viewpoint, is the absence of stu- 
dent exercises; a supplementary workbook would be most help- 
ful. (The excellent “exemplary applications” at the end of most 
chapters may provide a basis for a number of questions for stu- 
dents—a la Lindquist, 1953.) The text is also lacking in its treat- 
ment of analyses and tests to be conducted subsequent to the 
rejection of an omnibus hypothesis in the various ANOVA situ- 
ations. Although it may be expected that multiple comparison 
procedues be covered prior to a course in which this text might be 
used, a more adequate coverage may be desirable (see Games, 
1971). The general multiple comparison approach is that of hy- 
pothesis testing rather than interval estimation which may explain 
the author’s use of the Newman-Keuls procedure exclusively in 
instances after Chapter 2. Neither the Tukey nor the Dunn (utili- 
zing Bonferroni inequalities) procedures are discussed, while 
Scheffé’s procedure is covered in a single sentence. Another “follow- 
up” procedure completely ignored in this text is an index of ex- 
plained variance or measure of association, such as y? ог o?, be- 
es a “significant” independent variable and the dependent var- 
iable. 

There are also some variations in the treatment of a few 
topics, two of which are not correct. The first error encountered 
by this reviewer is in the author's explanation of the Duncan 
and Newman-Keuls post-hoc procedures (pp. 42-43). The use of 
the df-value associated with each mean rather than that as- 
sociated with the appropriate error mean square is in disagree- 
ment with such writers as Scheffé (1959, p. 73) and Winer (1971, 
p. 186). The tabular values (pp. 406-411) are correct; the user 
need only change the second footnote to the table on page 406, 
and the second step on page 42. Another error involves the state- 
ments of hypotheses that are tested using a nested design. For 
a nested factor B, Dayton's statement of the corresponding hy- 
pothesis implies equality of all of the B-effects or population 
B-means across the levels of another factor or across factor com- 
binations (pp. 203, 210). This is in contrast to Lindquist (1953, 
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p. 185) and Scheffé (1959, p. 184) who state that the hypothesis 
actually tested is the simultaneous equality of the B-means within 
each level of the other factor (or factor combination). | 

The author’s coverage of two other topics also varies from that 
found in other sources. In the definition of a contrast for samples 
of unequal sizes (p. 47) the n-values enter into the expressions 
as weights. There is no error in this definition although such 
weighting is not typically used (see Glass and Stanley, 1970, p. 
388 or Winer, 1971, p. 171) ; however, this definition is consistent 
with other formulations in the text and is needed for an analysis 
by the orthogonal contrast procedure when unequal but propor- 
tional cell frequencies arise in higher order ANOVA designs. 
Another variation involves Dayton’s estimate of pxy in ANCOVA 
where X is the covariable and Y is the dependent variable. The 
estimate implied by Dayton (p. 318) is the total-group correlation 
coefficient; a more appropriate estimate to be used in a compara- 
tive experiment with different groups of subjects may be a within- 
groups coefficient. 

Even though this text would most likely be used for a second 

or third course in educational statistics, a more thorough discussion 
of simple or one-way ANOVA and multiple comparisons such as 
that found in Glass and Stanley (1970) would be a desirable 
prerequisite. If so, a “design” course could then start with Chapter 
3. In courses conducted by this reviewer using this text Chapter 
4 (Incomplete Factorial Designs) was deleted. That these designs 
are rarely used in, at least, educational research may be reflected 
by the lack of any “exemplary applications” cited. A one quarter 
Session provides ample time to cover the essentials of chapters 3 
and 5 through 8 with time left over for such topics as planning 
and validation of experiments, components of variance, or some 
aspects of multivariate designs, depending upon the students’ 
background. 
‚ The criticisms cited in this review are quite specific and are 
judged to be minor relative to the overall strenth of Dayton’s 
text. His discussions of the various ANOVA designs are very well 
written and were greatly appreciated by three classes of education 
Students (most on the doctoral level). Design of Educational Es- 
periments, when supplemented by exercises, a few other handouts, 
and a limited number of outside reading assignments, has been 
found to be very satisfactory for use with the majority of stu- 
dents enrolled in a research design course. 
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Max D. Engelhart. Methods of Educational Research. Chicago: 
Rand MeNally, 1972. Pp. xx + 553. $10.95. 


In educational research, as in other disciplines, there are books 
best suited for classroom use as texts and there are books of equal 
merit which are best suited for reference use. There are few books 
which combine scope and depth in a manner which makes them 
equally suitable as texts and reference works. Methods of Edu- 
cational Research is such a book. 

Methods of Educational Research is organized around 17 
chapters which range in content from a short introductory chapter 
on the plan of the text to the final chapter, computer applications 
in research. The chapters between the first and last provide a 
wealth of information on the processes and methods of educational 
research. 

The text makes no assumption as to the level of sophistication 
of the reader, but there is an implicit assumption that the student 
is serious in his desire to learn the intricacies of educational re- 
search. Toward this end, the text provides questions for study and 
discussion and suggestions for further study at the conclusion of 
each chapter. 

The text draws frequently from the professional experience and 
acumen of its author. The chapter on methods of inquiry sets the 
tone for the remainder of the material, leaving little doubt that 
educational research will be treated as an integrated science. Care- 
ful mention is made of the problem of valuation and that edu- 
cational research is not the sole heir to this problem. 

The sequencing of material in successive chapters parallels the 
development and execution of research itself, from data generation 
to data reduction and analysis. All the essential elements are 
included along with illustrations and references. The illustrations 
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are particularly beneficial, the references both representative and 
classic. 

Users of Methods of Educational Research should find the text 
on observation and content analysis a valuable source of informa- 
tion. The references alone should save users many hours of effort 
in the development of a systematic literature. 

In addition to the essentials of educational research, there is 
the lucid development of numerous statistical concepts which 
students and professors tend to find burdensome, if not oppres- 
sive. Most other texts on educational research solve the problem 
by avoiding it. Not so, Methods of Educational Research. This 
text faces the problem and deals with it in a style which is at 
once readable and sufficiently rigorous. Although one-half of the 
text appears to be devoted to elementary descriptive and inferential 
statistics, such is not the case in the usual sense. The greater 
proportion of the chapters on statistics is devoted to a discoursive 
explanation of statistical concepts in relation to educational re- 
search. Users will find these chapters full of succinct explanations 
of procedures which they will encounter in the ever increasing re- 
search literature. However, this material should not be considered 
a sufficient alternative to the more comprehensive study of sta- 
tistics. The student who masters this text should encounter little 
difficulty in more advanced studies in inferential statistics and 
experimental design. 

Appended to the text are six tables which are essential to the. 
research worker. Table A is a 25 x 14 display of random digits. 
This table is noted as illustrative with a caution against its re- 
peated use. Table B contains areas and ordinates of the normal 
curve; Table C, the ¢ distribution with selected levels of signifi- 
cance for directional and non-directional tests; Table D, trans- 
formation of т to 2,, computed by the author; Table E, an abridged 
distribution of ха values through 15 degrees of freedom and the 
001 level of significance; Table F, an abridged set of values for 
the F distribution through 12 and 26 degrees of freedom for the 
numerator and denominator. In addition to the tables, there are 
answers to practice exercises, with graphic illustrations, and in- 
struction in the calculation of square roots. Many times the re- 
viewer has wished for just these tables in any of several texts 
used in introductory courses on educational research. 

No text is without its limitations. The material on computer 
uses is more interesting than illuminating. The chapters dealing 
with historiography, curriculum research, and research reporting 
are overviews, and as such lack sufficient depth; are not in con- 
вопапсе with the major portion of the text. The symbols used in 
i text are quite readable, but the irrregularity of the hand drawn 

ines across the top of X’s and Y’s is disconcerting at times. 
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The reviewer has had only modest success with students using 
frequency distributions for the calculation and understanding of 
central tendency and dispersion, and the use of the scattergram 
for calculating the Pearson product-moment correlation coefficient. 
The opportunities for errors of calculation inherent in these pro- 
cedures have appeared to offset the benefits derived from such 
direct, confrontations with data. However, devotees of these pro- 
cedures will find the author's argument for their continued use 
compelling. 

All things considered, Methods of Educational Research is a 
book which should prove a forceful contender in the arena of 
introductory texts on educational research. Careful readers of this 
text will discover themselves possessing more than a casual knowl- 
edge of the historical development of the educational research 
enterprize.. 

Perry В. CHILDERS," Coordinator 
Research and Evaluation Laboratory 
University of Wisconsin-Milwaukee 


Н. J. Eysenck. The IQ Argument. New York: The Library Press, 
1971. Pp. iv + 155. $5.95. 


The preface and introduction to this book are convincing state- 
ments of Eysenck's liberal beliefs concerning human equality and 
dignity. He does not want to find evidence for racial differences 
in intelligence, but as a socially responsible scientist he must ex- 
amine the existing evidence for Pres a possibility. His sincerity 
is unquestionable, and his approach to the subject seems unusually 
objective. The reader is thus prepared at the beginning to find а 
sound and critical discussion of the evidence for and against racial 


The “Jensenist мену. арз begins by presenting Jensen's 

sources of individual and group 
differences in intelligence ore implications for educational 
practice. Miren clearly supports what he calls "the Jensenist 
heresy" whose main points are: (a) "Compensatory education" 
programs have failed; (b) IQ test scores are good predictors of 
educational and professional success; (c) Individual IQ differ- 
ences in the white population are primarily attributable to genetic 
sources; (d) In the United States there are both social class and 
racial group differences in mean IQ scores; (e) These group dif- 
ferences cannot be entirely explained by environmental and/or 
motivational factors; (f) An evaluation of the existing evidence 


1Dr. Childers is one of the leaders of the AERA Special Interest Group 
"Professors of Educational Research" of which Dr. Robert Ingle is Chairman. 
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indicates that there are at least some genetic differences in in- 
telligence between white and black Americans and also between 
Americans of different social classes. In addition, Jensen discus- 
sed two broad categories of mental ability, one of which (“abstract 
reasoning”) would be differentially distributed among various 
social classes and races in the U. S., and the other of which 
(“associative learning”) would show negligible social class and 
racial differences. 

While Jensen’s discussion dealt with group differences in general, * 
the public has focused on the question of racial differences in 
particular, and Eysenck has thus done likewise. Eysenck's analysis 
consists mainly of the issues raised by Jensen, along with several 
new arguments of his own. 

Racial gene pools. The first major point made concerns racial 
gene pools. An entire chapter is devoted to showing that black 
and white Americans constitute different gene pools. The evidence 
presented for this is convincing and unlikely to be contested, Un- 
fortunately, Eysenck proceeds to some rather far-fetched specula- 
tion on what sort of a selection process took place when African 
slaves were brought to America. He concludes that: 


Thus there is every reason to expect that the particular sub- 
sample of the Negro race which is constituted of American 
Negroes is not an unselected sample of Negroes, but has been 
selected throughout history according to criteria which would 
put the highly intelligent at a disadvantage [in coming to and 
surviving in America]. (p. 42) 


This general conclusion is irresponsibly strong given the highly 
speculative nature of the analysis on which it is based. 

What is intelligence? When Eysenck addresses himself to the 
question “What is intelligence?", his analysis deals almost solely 
with traditional IQ tests. He argues that IQ test items are valid 
instruments for assessing intelligence because they have been ве- 
lected objectively and largely on the basis of their predictive power 
for educational and professional purposes. He fails to discuss 
adequately the fact that the educational and professional situation 

speaks of is a racist-capitalist society that has a vested (and 
invested) interest in an educational system and an economic 
structure that keep blacks from competing economically with 
Whites. Eysenck argues that blacks need the kind of mental ab- 
ility that ТО tests measure because they “seek the rewards of 
our type of civilization (p. 74).” But he doesn’t seem to consider 
the fact that blacks have been conditioned to desire those “re- 
wards,” even to the point of wanting straight hair and light skin! 

hermore, there is increasing evidence that our educational 
system (and thus the requirements for success in it) is not the 
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only way to industrial and technical advancement (Macciocchi, 
1971). 

It is generally agreed that IQ tests have little relation to any 
of the existing theories of intellectual development. Both general 
theories of cognitive development (Bruner, 1964, 1966; Hebb, 
1949; Piaget, 1952; Wallon, 1942; Werner, 1948), as well as more 
specific models of cognitive processes in children (Jeffrey, 1968; 
Kagan, Rosman, Day, Albert, and Phillips, 1964; Lewis, 1971; 
Rohwer, 1971; Watson and Ramey, 1969; White, 1965; Zeaman 
and House, 1963;—to name only a few!) make it clear that 
traditional IQ tests are very inadequate measures of cognitive 
processes as we understand them. Furthermore, even the most 
“culture fair” tests may not measure the same abilities in different, 
cultures (Vernon, 1970). This extremely important consideration 
has been disucssed in an excellent analysis by Cole and Bruner 
(1971). Eysenck’s discussion of the nature of intelligence is thus 
severely weakened by focusing almost exclusively on traditional 
10 tests. He does refer approvingly to Jensen’s attempt to go be- 
yond IQ tests and to analyze more precisely the nature of the 
ability differences between races that he hypothesizes. However, 
Jensen’s analysis has been subject to convincing criticism (Bodmer 
and Cavalli-Sforza, 1970; Green and Rohwer, 1971; Rohwer, 
1971). 

Evidence for major genetic determinants of IQ test scores. De- 
spite the weakness of IQ tests as measures of intelligence, there 
are group differences in IQ test performance, and it is legitimate 
to investigate the sources of these test score differences. The major 
message of Eysenck’s book is that there is convincing evidence 
that at least part of the observed IQ difference between Ameri- 
can whites and blacks has genetic sources, which has important 
implications for educational practice. 

Like others who emphasize the genetic determinants of indiv- 
idual differences in intelligence, Eysenck relies heavily on those 
studies comparing the IQ scores of twins reared together and apart, 
of siblings, of cousins, ete. These studies suggest that about 80 
per cent of the variability of ТО scores of the white population 
in England and the U. S. has genetic sources. Eysenck points out 
that a large heritability of IQ differences within a given group 
(e.g., American whites) says strictly nothing about the source of 
differences within another group (e.g., American blacks) or be- 
tween any two groups. While he states this principle repeatedly, 
Eysenck does not seem to take it seriously when he concludes 
that, “it seems likely that it [heritability of IQ scores in the black 
population] would not differ very greatly from the 80 per cent or 
so quoted for white populations (p. 67).” He uses the high her- 
itability of ТО test scores for whites as important evidence sug- 
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gesting genetic sources of racial differences in intelligence. He act- 
ually gets quite confused on this issue, and even talks about 
heritability for individuals: “. . . the figure of 80 per cent heri- 
tability is an average. It does not apply equally to every person 
in the country. For some people environment may play a much 
bigger part than is suggested by this figure; for others it may be 
even less (p. 67).” This is very unsound reasoning: heritability 
is a property of populations, not of individuals . . . nor of traits 
(see Fuller and Thompson, 1960; Hirsch, 1970)! Eysenck also 
states that “the discovery of within-race genetic factors determin- 
ing IQ differences is a necessary, but not a sufficient condition of 
accepting the genetic argument as applied to between-race dif- 
ferences (p. 113).” This is not so. Within-race genetic factors are 
neither necessary nor sufficient; it is perfectly possible to have 
genetically-determined between-race differences but environment- 
ally-determined within-race differences. It is disappointing that 
Eysenck presents such a confused analysis of heritability and 
thereby only confirms Hirsch’s remark that “. . . in the study of 
man a heritability estimate turns out to be a piece of ‘knowledge’ 
that is both deceptive and trivial (Hirsch, 1970, p. 98).” 

Other evidence used by Eysenck to support the importance of 
genetic determinants of intelligence is the observed regression to- 
ward the mean of children’s IQ scores when compared to those 
of their parents. For a group of parents of a given IQ level, the 
mean IQ of all of their children is about halfway between the 
population mean and the parents’ mean. Eysenck argues that such 
regression toward the mean is a characteristic of variables that 
have significant genetic determinants. But this reflects an unfort- 
unately incomplete analysis: regression toward the mean is also 
characteristic of traits significantly (or even solely) influenced by 
environmental causes. The regression phenomena itself gives ab- 
solutely no indication of the causes of the variable being measured 
(see Furby, 1971, for a detailed explanation). 

Evidence against major environmental sources of IQ test scores. 
Eysenck, like Jensen, also interprets regression effects as evidence 
minimizing the importance of SES (and thus of environment) in 
determining IQ, but the same facts could be interpreted to support 
exactly the opposite conclusion. In reality, these facts tell us 
nothing about а possible causal relation between SES and 10. 

_Eysenck reviews empirical results of racial differences in ТО) for 
different age groups from 2 years to adulthood (relying heavily 
on Shuey, 1966). A major argument he uses to support the role 
я genetic factors in determining racial differences in 10 is the 
d that SES cannot account for all the observed black-white 

ifferences in IQ scores. While it is certainly true that socio- 
economic status per se does not explain all of the racial differences, 
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it does not necessarily follow that those group differences do not 
have environmental sources. The use of SES as an explanatory 
variable in psychological research is open to much criticism (see 
Tulkin, in press, for an excellent discussion of this issue). Con- 
trolling for SES certainly does not control all of the relevant stimu- 
lation to which the child is exposed (Bodmer and Cavalli-Sforza, 
1970; Lewis and Wilson, 1971; Pavenstedt, 1965; Tulkin, 1968; 
Wachs, Uzgiris, and Hunt, 1971; Wellman, 1940). Eysenck's 
argument that other American minority groups (Orientals, Jews, 
Indians) score better than blacks despite their low socio-economic 
position is open to the same criticism. 

The second major variable that Eysenek considers in relation 
to racial differences is motivation. He argues that low motivation 
and/or low self-esteem on the part of blacks cannot account for 
racial IQ differences, since, (a) offering money, candy, or cigarettes 
does not significantly change their test performance, and (b) blacks 
do not score lower on personality tests of self-esteem. But the 
studies used to support these conclusions employ very primitive 
manipulations of motivation and evaluations of self-esteem. White 
psychologists have enough trouble knowing what is motivating for 
whites (White, 1959) let alone knowing what is motivating for 
blacks! Furthermore, the present means we have of assessing self- 
esteem make it difficult to swallow the conclusion that blacks have 
equal or greater sense of personal worth than whites. While the 
evidence on possible motivational sources of group differences in 
IQ scores is equivocal, there are just as much data suggesting 
that there are racial group differences in motivation (Battle and 
Rotter, 1963; Coleman, Campbell, Hobson, McPartland, Mood, 
Weinfeld, and York, 1966; Franklin, 1963; Lefcourt and Ladwig, 
1965), and that motivation can affect cognitive functioning (Har- 
ter, in preparation; Lewis and Goldberg, 1969; White, 1959) and 
IQ test performance (Zigler and Butterfield, 1968) as there are 
data to refute the importance of motivation. 

Unfortunately, Eysenck does not investigate the possible effects 
of environmental factors other than SES and motivation. Yet 
both theory and research in cognitive development strongly sug- 
gest the importance of a myriad of other environmental factors 
such as: nutritional status (Scrimshaw and Gordon, 1968), con- 
tingency of maternal reinforcement (Lewis and Goldberg, 1969; 
Yarrow, Rubenstein, and Pedersen, 1971), intensity and variety 
of stimulation (Wachs, Uzgiris, and Hunt, 1971), verbal stimula- 
tion and encouragement (Van Alstyne, 1929; Wachs, Uzgiris, and 
Hunt, 1971; Yarrow, 1963), birth order and family size (Sutton- 
Smith and Rosenberg, 1970), practice in conceptual elaboration 
(Rohwer, 1971), to name only a very few. The fact that SES 
does not account for all of the racial differences is hardly evidence 
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that environment does not account for all of the differences. 
Eysenck himself states that “The colored and the white children 
now after desegregation brought into contact in the same school 
room have been socialized in different ways, have ‘internalized’ 
different standards and norms of behavior and respond differently 
to the usual sanctions of the school room (p. 151).” He points out 
how this creates almost insurmountable problems in trying to deal 
with blacks and whites in the same classroom situation. But he 
doesn’t follow up the logical implications of this argument for 
trying to deal with blacks and whites in the same testing situ- 
ation! Cole’s and Bruner’s (1971) analysis seriously questions 
drawing any inferences about ethnic differences in competence 
(such as intellectual capacity) from performance differences in a 
standard testing situation: 


One must inquire, first, whether a competence is expressed in a 
particular situation and, second, what the significance of that 
situation is for the person’s ability to cope with life in his 
own milieu . . . when we systematically study the situational 
determinants of performance, we are led to conclude that cul- 
tural differences reside more in differences in the situations to 
which different cultural grups apply their skills than to dif- 
ferences in the skills possessed by the groups in question (Cole 
et al., 1971, Ch. 7). (Cole and Bruner, 1971, р. 874) 


The one study (Burks, 1928) Eysenck discusses which did at- 
tempt to deal with more than just a global SES measure con- 
cluded that environment accounted for about 20 per cent of the 
observed individual differences in IQ. But this study is 45 years 
old, and it is difficult to rely very heavily on Burks’ inability 
to identify and measure many of the possibly important environ- 
mental variables, examples of which I have mentioned above. 

_Eysenck makes the typical plea for more research on racial 
differences in intelligence, and he heartily supports Shockley’s 
(1970) proposal for studies comparing the intelligence of subjects 
of mixed blood. The one such study he considers exemplary is de 
Lemos’ (1969) comparison of full-blood and part-blood Aboriginal 
children (ages 8-15) in Australia. On a number of Piaget’s con- 
Servation tasks, she found the part-Aboriginal children to be in 
advance of the full-Aboriginal subjects. This is certainly evid- 
ence suggesting genetic sources of racial differences in cognitive 
development, since de Lemos claims that the environmental con- 
ditions of all the children were identical. However, this is only 
а single study, and it is dangerous to give it as much weight and 
Space as Eysenck does. Indeed, a replication by Dasen (1970), 
using the same community as de Lemos, failed to find the racial 
differences in conservation performance, Furthermore, contrary 
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to what Eysenck reports (but not reported by de Lemos in her 
1969 article!), Dasen states that the ancestry of each child was 
well known in the community. In addition, the experimenter 
could judge at a better than chance level the ancestry of a child 
just from physical appearance. Still another study (Dasen, de 
Lacey, and Seagrim, 1972) found almost no difference in the devel- 
opment of concrete operations between Australian children of 
European and of Aboriginal origin. While these studies may not 
have been available to Eysenck, they provide an excellent example 
of why the results of a single study must be interpreted with 
extreme caution. Eysenck tends to overemphasize and overgenera- 
lize from single studies. 

How much can we change IQ test scores? Eysenck concludes 
his scientific analysis of the evidence by discussing the implica- 
tions of the high heritability of IQ for efforts to change IQ through 
environmental manipulation. Unfortunately, he forgets his own 
warning that 80 per cent heritability for whites tells us nothing 
about the heritability for blacks nor about the sources of group 
differences. He argues that even if the environment were identical 
for everyone, the shape of the distribution of IQ's would be only 
slightly different from what it is now. But the validity of this 
statement depends entirely on which uniform environment one 
chooses. Furthermore, this fact implies nothing about the effect 
of uniform environment on the distribution of black IQ’s nor on 
black-white differences. While Eysenck calls his position “inter- 
actionist” he actually maintains that there are genetic differences 
in intelligence such that, given a common (and realistic-natural) 
environment, American Negroes are less intelligent than Ameri- 
can whites. But if Eysenck is truly an interactionist, and if there 
are genetic differences in intelligence between blacks and whites, 
then he must entertain the possibility that there are some en- 
vironments in which blacks do better than whites. 

In discussing environmental manipulation of IQ scores, 
Eysenck analyzes at some length the educational implications of 
research on intelligence. While certain scientists would question 
his assumption that compensatory education has failed (Camp- 
bell and Erlebacher, 1970), few would disagree with his position 
that educational practice must take into account the abilities and 
aptitudes of the individuals involved. 

As part of his discussion of “changing human nature,” Eysenck 
makes some interesting points about the characteristics of the IQ 
distribution. Unfortunately his analysis is quite incomplete, and а 
comparison of the theoretical and observed curves he presents 
actually suggests that the heritability of IQ among blacks is much 
smaller than for whites. In fact, the observed shape differences 
between the distributions for blacks and whites (which Eysenck 
fails to discuss) are entirely consistent with a model in which 
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the environment has additive effects on IQ. According to such a 
model, whites might receive an important benefit from the environ- 
ment while blacks might receive much less (1.е., group differences 
have environmental sources), and yet most of the white variability 
in IQ would be genetic while most of the black variability would 
be environmental (see Furby, 1972, for a detailed exposition of this 
model). While such a model has by no means been proven yet to 
accurately represent the true picture, it is certainly equally as 
logical, reasonable, and likely as any of the analyses Eysenck offers. 

Good intentions + faulty analysis = bad product. It is difficult 
to know for whom this book is intended. It appears to be written 
for popular science consumption (like his Know Your Own IQ), 
in which case he has gone way too far in concluding that, “All 
the evidence to date suggests the strong and indeed overwhelming 
importance of genetic factors in producing the great variety of 
intellectual differences which we observe in our culture, and much 
of the difference observed between certain racial groups (p. 126).” 
The evidence is much too inconclusive for such a statement. If, 
on the other hand, Eysenck’s justification for writing the book is 
sincere, namely to encourage more research in this area, then in 
order for his fellow scientists to give his argument serious con- 
sideration he must supply the specific sources of the empirical 
results from which he draws his conclusions. It is certainly not 
difficult, to write for the general public while still respecting the 
scientific community’s necessity of knowing exact references (see 
Montague, 1971, for a good example). Without a specific reference 
to consult, it is hard to believe some of his statements, such as: 
‘Cederlof, Lundman, Friberg, and their associates have studied 
tens of thousands of identical twins one of whom was a smoker 
and the other a non-smoker [italics added] (р. 78)." 

Eysenck stated goals and attitudes are commendable. His in- 
tentions are the very best as revealed in his closing chapter on 
the social responsibility of science. But he fails disastrously in his 
analysis and interpretation of the data. There are serious meth- 
odological and logical fallacies in his evaluation of the evidence. 
This weakness is both unexpected and very disappointing since 
the reader is easily convinced by the introduction that it is only 
with much personal pain and in the face of irrefutable evidence 
that Eysenck will accept racial differences in intelligence. He sin- 
cerely thinks he is being objective and only facing the facts; 
actually he is over-interpreting the data and drawing unjustified 
Conclusions. This book is remarkably full of statements support- 
Ing important scientific principles of analysis which are then con- 
tradicted by Eysenck in his own interpretation of the data. 

There is not much new here that hasn’t been discussed by 
Jensen, But Eysenck analyses the evidence less rigorously and 
both his stated and implied conclusions are more categorical on 
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the race issue than Jensen’s. While I agree with Eysenck that 
researching possible genetic differences in intelligence does not 
necessarily mean one is a racist, I cannot agree with his implica- 
tion that disagreeing with Eysenck’s evaluation of the evidence 
is a sign of “readers with closed minds (p. ii).” 
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Warren G. Findley and Miriam M. Bryan. Ability Grouping: 
1970 Status Impact and Alternatives. Athens: University of 
Georgia, 1971. Pp. 94. Free upon request from Dr. Morrill 
M. Hall, Director, Center for Educational Imporvement, Uni- 
versity of Georgia, Athens. GA 30601. 


Lack of administrative and financial support has resulted in 
overlapping and fragmentary efforts in many areas of educational 
research. The Findley and Bryan report is an attempt to bring 
some order to this chaos with respect’to research outcomes con- 
cerning the practice of grouping in schools. 

The readership of this report should certainly include school 
administrators at the local level, since they make many of the 
decisions—to group or not to group, how to group. Perhaps with 
this readership in mind, the authors begin the report with a 
section of conclusions and recommendations. Findings are sum- 
marized in a logical sequence which yields the conclusion: group- 
ing should not be practiced except for the learning of individual 
subjects and in this case only if “the information gained by test- 
ing and/or observation is the first step in a program of diagnosis 
and individualized instruction.” Further, “provision should be 
made for frequent review of each individual’s status as part of the 
Instructional program.” 

Four chapters and appendices comprise the remainder of the 
report. The first of these presents results of a survey of grouping 
practices. A wide range of responses was obtained to a nine-item 
questionnaire, and these are analyzed along several dimensions 
—geographic, size of school district, racial composition of student 
bodies, ete. This section is quite readable and should be of value 
to school administrators. 

At this point it is necessary to bring up the distinction between 
ability grouping and achievement grouping. The authors ap- 
parently take the position that the second is merely another cate- 
gory of the first and subsume both under their title, Ability Goup- 


کڪ 
"Both Susan Harter and Lee Cronbach made suggestions for significant im-‏ 
poy ments to earlier drafts of this review. However, the author claims all the‏ 
Tedit for any remaining inadequacies.‏ 
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ing. Unfortunately this approach may confuse many readers who 
make very clear (if not uniform) distinctions between the two 
categories. The authors acknowledge that the subject matter based 
grouping they recommend under certain circumstances “has some- 
times been referred to as ‘achievement grouping.’” However, this 
clarification is “in the fine print” and may not overcome the effect 
from the title of the report. It is even conceivable that some 
schools might use this report as justification for undersirable group- 
ing practices based on subject matter tests, on the grounds that 
these do not measure “ability.” In the survey referred to above, 
no distinctions were suggested for categorizing the bases for group- 
m а that responses cover all grouping practices whatever their 
origin. 

The second chapter reviews an extensive cross-section of re- 
search literature on effects of grouping, particularly with respect 
to generally less privileged minority groups. It is well organized 
and should be understandable to most professional educators. It 
is in this section that the major arguments against most grouping 
practices are developed. _ 

The third chapter concerns the tests commonly used to effect 
grouping. The main concerns are bins and validity. А seeming 

aradox is adroitly brought forth: While most tests are clearly 

јанед in favor of middle-class, white culture, their predictive 
efficiency is frequently greater in the case of students of other 
cultures. The resolution is clear: The higher validity follows from 
the white, middle-class curricula with which minority children 
must cope. Those minority children who by good fortune сап cope 
with the typical curriculum do so much better than the others of 
their group that high validity coefficients result. This finding is 
not new, but its implication for need to alter curricula often seem 
to be overlooked by school personnel. This chapter involves a good 
many measurement technicalities and may not be accessible to 
some of the readers to whom the report is directed. An attempt 
is made to define terms and orient the reader, but in the reviewer's 
opinion the effort falls far short considering the complexity of the 
chapter. For example, validity is never defined in terms of cor- 
relation, though validity coefficients are discussed quite freely 
throughout the chapter. 

The last chapter presents various alternative strategies to the 
grouping practices that are (we hope) to be done away with. This 
discussion will not tell everything а school administrator needs 
to know in this area, nor is it intended to. Nevertheless, it does 
describe viable alternatives to the seemingly inevitable clamor 
for grouping and gives references for further investigation. 

An appendix discusses the contentions of Arthur R. Jensen 
concerning racial intelligence differences. These might be inter- 
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preted as undermining the authors’ earlier conclusions to the effect 
that most grouping practices are unnecessary and undesirable. An 
excellent argument is presented refuting those of Jensen's conclusions 
which might suggest use of observed ability test score differences 
for grouping. 

Ковевт В. Frary 

Virginia Polytechnic Institute and 

State University 

Blacksburg, Virginia 


Donald W. Fiske. Measuring the Concepts of Personality. Chicago: 
Aldine Publishing Company, 1971. Pp. xviii + 322. $9.95. 


Fiske set out to present a view of measurement which is larger 
than а compendium of psychometric techniques. The result is а 
definitive work. Fiske, an expert and major contributor in the 
area of personality assessment, articulates the Gestalt as well as 
the details of his view of the field by dint of highly organized 
thought and a gift for uncluttered language. 

The author is concerned with personality measurement in the 
context of basic research; he has carefully disavowed facile gen- 
eralizations of the discussion to individual assessment for person- 
nel decisions or clinical diagnoses. Measuring the Concepts of Per- 
sonality is not a practitioners’ handbook. Both Cronbach's 
Essentials of Psychological Testing and Kleinmuntz's Personality 
Measurement summarize testing procedures and instruments which 
аге currently in use, Fiske does not compete with their efforts. 
Although the basic research must be familiar with instruments 
which have been developed so that his study builds on previous 
findings, Fiske emphasizes the problems of measuring aspects of 
personality which have not before been successfully identified or 
quantified. 

The book has two parts; in the first, the context of measuring 
is organized by outlining the nature of personality, the isolation 
of particular variables and the consequences of observation. The 
second part of the book is about measuring. Quite likely the 
Author chose “concepts” as the object in the title of the book 
rather than the term “constructs” which has а more technical and 
Precise conotation in psychology. The more literary “concepts of 
Personality” accurately reflects the level of specificity of most 
theories in personology. The point of the book, and the feature 
Which sets it apart from other efforts in the field, is the connec- 
tion drawn between theory and experimentation. The need for 
interaction between the two is a recurrent theme; Fiske not only 
Persuades us that personality theorists and psychometricians need 
to work together, he shows us how. Fiske narrates the process а 
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scientist would follow from an idea, to the specification of a con- 
struct (mediated by an understanding of the processes implied 
in personality), to the choice of an appropriate mode of measuring, 
and finally to a validation of the construct as well as the instru- 
ment. Fiske does not tell one “how to write instruments"; he is 
concerned with more fundamental issues with greater scientific im- 
port. 

The book may be called definitive because it leaves nothing 
out. The virtue of the author’s style is that each technical aspect 
of measurement theory is presented where it fits most logically, 
where it is best seen in relation to larger issues. In Chapter 10, 
“А Human Being Takes a Test," the author combines several 
issues which are frequently omitted from other texts and rarely 
interrelated: the impact of the testing situation on the subject’s 
emotional state and the potential for differential responses, the 
conditions which increase the probability that response sets will 
operate, and the ethical constraints to be considered when testing 
human subjects. Fiske’s view of personality assessment provides a 
superstructure within which finer subtopics can be meaningfully 
presented. For example, he proposes to classify different ways of 
measuring into one of six modes; as part of the discussion about 
the potential reactive nature of modes with subjects as data pro- 
ducers, he covers even so small a detail as the best means of 
administering experimental instruments so as to increase rapport 
and minimize reactive effects. 

Measuring the Concepts of Personality is a remarkably readable 
book considering the sophistication of the thinking. Fiske uses non- 
technical language and avoids vocabulary with which only as- 
sessment  methodologists are familiar. The most technical 
chapter about deriving “indices,” the scores or quantified repre- 
sentations of the data, can be understood by the reader with only 
a basic understanding of elementary correlation theory and the 
concept of variance. In this one chapter, Fiske summarizes all of 
the aspects of reliability and validity as they pertain to the devel- 
opment of accurate measuring devices. For example, he presents 
the formula for Cronbach’s alpha for the purpose of making the 
following point, “In interpreting values obtained by this formula, 
we usually look carefully at the variance of the person scores, 
V;. Not only do we want our test to spread out subjects so that 
they are differentiated as much as possible, but it is also known 
that the obtained reliability is a function of this variance, аз 
the formula shows.” (p. 153) Fiske avoids mathematical deriva- 
tions and explanations; as a consequence less technical expertise 
is required of the reader and the major measurement problems 
which concern the researcher are discussed without interruption. 

Measuring the Concepts of Personality can be highly recom- 
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mended to research psychologists and graduate students in psy- 
chology, especially those among the latter who have avoided tak- 
ing assessment courses because of their highly mathematical or 
statistical content. The book is not, however, an ideal text for a 
basic course in personality measurement. One would lose the con- 
ceptual unity of the book if it were segmented for weekly dis- 
cussions. The book would be put to its best use if graduate stu- 
dents in psychology read it during the first week of a semester 
course and subsequently applied its principles to specifying their 
own construct, devising test instruments, and carrying out some 
validation procedures. The book is also recommended reading for 
a researcher in education, business, and other applied behavioral 
sciences who is about to embark on a study requiring an assess- 
ment of personality variables. 

The most outstanding parts of the book are Chapter Six, "Speci- 
fication of Constructs,” and the treatment of construct validation. 
Both of these parts are important because in them Fiske demon- 
strates how conceptual formulations and empirical work are to be 
coordinated. In addition, both discussions illustrate Fiske’s ability 
to see solutions which are at once congruent with ultimate theo- 
retical goals and in keeping with constraints imposed by the 
current state of the art. The best example is Fiske’s recommen- 
dation that the method-specifie variance which confounds our as- 
sessment of the “true” construct be taken into account by the 
specification of a subconstruct for each mode of measuring. Al- 
though such a procedure is not ideal, it is the best means available 
to solve the problem, that the perspective of the data producer 
(usually associated with different ways of measuring) contributes 
some reliable portion of the variance of the resulting index. By 
establishing the subconstructs which are the objects of our mea- 
surement, we acknowledge that the true construct is not being as- 
sessed and can make more accurate predictions about the relation- 
ship of our results to the results of other measures. The discussion 
of construct validation is equally pragmatic. Although the 
ideal situation would be the independent existence of a con- 
struct against which instruments were compared for their validity, 
the more realistic approach, given our present skills in persono- 
logy, is to consider a simultaneous validation of constructs and 
instruments allowing either tests or theories to be improved by 
empirical findings. 

One minor complaint must be recorded. The author used the 
Rorschach Inkblot Test and the Thematic Apperception Test 
as examples more frequently than any other instruments. It is 
acknowledged that the entire book is concerned with basic re- 
search and not with individual diagnosis and that, therefore, even 
Telatively low positive correlations with a criterion may be of 
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interest. However, these limitations on the use of the two tests 
are not specified; the sophisticated reader knows of the lack of 
empirical evidence that judges can agree in scoring the Rorschach 
and the TAT and that the obtained profiles do not correlate well 
with other measures of personality (Adcock, 1965; Jensen, 1965, 
Kleinmuntz, 1967). The novice, who has so much to learn from 
this text, may be misled. The uninitiated reader is likely to gen- 
eralize from Fiske’s favoritism to uses outside of the research 
setting. Surprisingly Fiske used these examples even though they 
do not meet his criteria for good instrumentation; although the 
task specified for the subject is ambiguous and standardized for all 
subjects, neither the TAT nor the Rorschach can be considered 
homogeneous measures of a specified construct. 

Fiske’s book will be read for two reasons: it is a comprehensive, 
well-integrated view of personality assessment, and it requires 
little prior experience in measurement theory to comprehend it. 
Although novice students will appreciate Fiske’s ability to avoid 
sophisticated language, they ате not the book’s only audience. The 
author’s work is so well done that his colleagues can learn from 
him. 


REFERENCES 


Adcock, С. J. Review of the Thematic Apperception Test. In 
О. К. Buros (Ed.). The Sixth Mental Measurements Yearbook. 
Highland Park, N. J.: Gryphon Press, 1965, pp. 538-535. 

Cronback, L. J. Essentials of psycholgoical testing (Third Edition). 
New York: Harper and Row, 1970. 

Jensen, A. R. Review of the Rorschach. In O. K. Buros (Ed.). 
The Sixth Mental Measurements Yearbook. Highland Park, 
N. J.: Gryphon Press, 1965, pp. 501-509. 

Kleinmuntz, B. Personality measurement. Homewood, Ill.: The 
Dorsey Press, 1967. 

Loretta A. SHEPARD 
Laboratory of Educational Research 
University of Colorado 


М. Clemens Johnson. Educational Uses of the Computer: An Intro- 
duction. Chicago, Illinois: Rand McNally and Company, 
1971. Pp. xv + 239. $4.50, Р/В. 

Given the increasing use of computers in education, coupled 
with the shortage of quality books in this area, we undertook the 
task of reviewing Educational Uses of the Computer: An Intro- 
duction with some interest. According to the author, the book was 
written “Чо provide a concise and nontechnical introduction to а 
variety of uses of the computer in education.” It is aimed at а 
very large audience; namely, beginning students in education and 
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other social science fields. In addition to serving as a textbook 
for these student populations, the author suggests that the book 
would be useful as a supplemental text in many education and 
social science courses, and in introductory courses in research, 
statistics, and data-processing. 

The book consists of eight chapters discussing three general 
topics. The first chapter covers some basic information about com- 
puters, such as history, characteristics, programming, and lan- 
guages. Chapters 2 through 5 introduce the applications of the 
computer to education. Specific applications, such as educational 
data-processing, computer-assisted instruction, and computer 
guidance models, are discussed briefly. The final three chapters 
are concerned with the use of computers in educational research. 
Topics discussed include data coding, elementary statisties, hy- 
pothesis testing, and computer simulation. Also included is a help- 
ful glossary of terms commonly encountered in connection with 
computers. 

While we heartily support the general purpose of this book, we 
feel that it has many shortcomings which reduce its usefulness. 

To begin with, it is our contention that the writer has presented 
a series of topics which, although related by a general underlying 
theme, is too broad, and for which the treatment is too shallow 
to provide any one group of students from the intended audiences 
with enough information to make reading the book worthwhile. 
It is our opinion that Johnson would have produced a better book 
if he had restricted the range of his target population, and pre- 
sented more detailed information about fewer applications of the 
computer, since the briefness of some parts resulted in too many 
incomplete and confusing sections. A case in point is the few pages | 
on computer programming. This section might well have been left 
to introductory books such as Anderson’s Computer Programming 
FORTRAN IV. In place of this discussion, for example, the 
writer could have expanded his very brief reference to the BMD 
and SPSS packages of statistical programs, and mentioned other 
canned programs which are quite important to a large class of 
computer users. 

Another example is the treatment of statistics in the section on 
Tesearch uses of computers. The material is presented at such а 
Superficial level that it is of little or no use to any reader. How 
E un one discuss descriptive statistics and hypothesis testing 
di f ve pages! Formulas are poorly presented, not all terms are 
fined, assumptions underlying the use of certain statistics are 
a always stated, and generally the material is presented in 
E S seems to the reviewers a disorganized fashion. In this same 
3 "Hon, the author misses a good opportunity to point out 
пе of the important uses of the computer in research when he. 
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says that “sampling distributions are mathematically rather than 
empirically derived.” The fact is that computers are now being 
used to simulate sampling distributions in cases where mathe- 
matical solutions are difficult, if not impossible, to obtain. 

Another problem with the book is that we find some of the in- 
formation presented is either misleading, trivial, or irrelevant to 
the author’s stated purpose. Instances of misleading information 
abound. For example, it is just not true, as Johnson suggests, 
that it takes one to two weeks to learn to operate a keypunch 
machine. Most keypunch users master it within a couple of hours, 
although it is true that speed and accuracy skills take time to 
develop. 


Trivial information such as, “A (computer) center which oper- я 


ates three shifts around the clock will require a somewhat larger 
staff than one which ceases operation at night” appears 
throughout the book and will undoubtedly prove annoying to the 
reader. 

Trrelevant filler clutters up the book. For example, while de- 
scribing instructional uses of the computer, the author includes a 
description of grading and credit procedures used in computer 
courses. How this information expands the reader’s knowledge of 
uses of the computer remains unclear. The most flagrant example 
is the inclusion of one full chapter (161 pages) on research and 
development projects. Included in the chapter is information on 
topics such as proposal writing, program planning, and educational 
resources information centers. Little if anything would have been 
lost if the entire chapter had been omitted. Perhaps then the treat- 
ment of such topics as computer managed instruction, which is 
fast becoming one of the most important areas of investigation in 
educational technology, could have been expanded. 

In summary, it is difficult to recommend this book with all of the 
associated problems we feel that it has. However, perhaps for that 
group interested in learning specifically about the material covered 
in Chapters 2 through 5, the book will be adequate. On the other 
hand, the material covering the use of computers in research is 
quite unacceptable, for many of the reasons given above. For 
students with some quantitative background, the book edited by 
Holtzman (1971) will likely prove more satisfying, although a 
few of the contributions are quite technical. Thus it is apparent 
that our search must continue for that definitive computer uses in 
education book that is nontechnical, written in a clear, concise, 
comprehensive manner, and yet capturing the sense of excite- 
ment that most of us have when we work with the computer. 
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John Jung. The Experimenter’s Dilemma. New York: Harper & 
Row, 1971. pp. X + 306. $4.95 (paperback). 


Essentially this book is a collection of readings with an 81 
page introduction. For the most part this review will deal with 
that introduction. The central problem that Jung addresses 
himself to is the effect of experimenter bias on psychological ex- 
periments. The very process of hypothesis formulation and the 
manipulation of conditions subjects are exposed to sensitizes an 
experimenter to the expected results. While no reputable psycho- 
logist would deliberately bias his results, there is considerable 
evidence that human (and some infra-human) subjects may re- 
spond to stimuli emitted unintentionally by the experimenter. 

The use of human subjects poses several problems for research 
psychologists. It is felt by several authors that most, if not all, 
subjects will interact with the conditions of the experiment. The 
subject will use his perceptions of the experimental conditions to 
put himself in the best light, to do well, to be cooperative, to 
please the experimenter, ete. A major biasing effect in the study of 
human behavior is that an overwhelming proportion of the sub- 
Jects used are college sophomores enrolled in introductory psy- 
chology courses. These subjects are not even representative of all 
college students, much less the total population, and therefore re- 
present much biasing of experimental results. Many social psy- 
chological studies use deception in order that the subjects will as- 
Sumably not realize the true nature of the experiment. However, 
there is some evidence that deception is not effective. Apparently 
In some experiments subjects go along with some rather ridiculous 
Situations whieh would not stand up under even the most super- 
ficial analysis by the subjects. Part of this phenomenon may be 
attributed to the prestige and social responsibility attributed 
to the experimenter. 

In the use of human subjects the experimenter has a moral 
and ethical obligation to insure that no harm will come to his 
Subjects. In fact there is much evidence that subjects are at least 
covertly aware of these obligations and do not worry for their 
own safety nor the safety of other subjects. In social psychologi- 
cal experiments quite often the variable(s) under study relate to the 
е self-perception. If the experimenter is completely honest 
€ will be unable to gather the type of data necessary for his study, 
owever if he is deceptive he may adversely effect his subjects. 
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The experimenter must evaluate the effects of deception on his 
subjects and determine if the data are worth that effect. 

Probably the most discouraging aspect of this work lies in the 
very last section of Jung’s introduction in which he is discussing 
trends. His last reference is dated 1953. If this is the date of 
the most enlightened statement on the problems of experimental 
ethics, has the field actually been moving forward? 

The articles included in the readings quote each other quite 
frequently. In all honesty if one were to select and read two or 
three of the selected readings he would have been exposed to most 
of the ideas in the entire book. 

The book or some of the articles from it should be required 
parallel readings in courses which inelude topies concerning the 
manipulation of human subjeets. While the book is directed to- 
ward social psychology the problems and cautions are relevant 
io studies in education. The only major translations needed is the 
substitution of the words teacher and students {ог experimenter 
and subjects, respectively. The problems of experimenter bias 
and subject deception are very evident in many educational 
studies. 


Cart Nem SHAW 
University of Houston 


Richard I. Lanyon, and Leonard D. Goodstein Personality As- 
sessment. New York: Wiley, 1971, Pp vii + 267. $8.95 

Leonard D. Goodstein, and Richard I. Lanyon (Eds.) Readings 
in Personality Assessment. New York: Wiley, 1971. Pp. xiii 
+ 792. $11.95 


Anyone who thinks about reading or writing a book in per- 
sonality assessment quickly confronts the dilemma symbolized 
їп Cronback and Gleser's (1957, 1965) bandwidth-fidelity analogy. 
That is, the wider the scope, the more “wave lengths” upon which 
the authors attempt to responsibly resonate, the lower the like- 
lihood of fidelity or thoroughness with which the material can be 
‘communicated and integrated by all but the most talented of 
readers and writers. The authors of Personality Assessment have 
resolved the problem in a manner which should not prove startl- 
ing to many readers. Because the questions inherent in personality 
description and evaluation are tenaciously rooted, the list of 
topics explored by the authors are inevitably similar in many 
respects to those in other texts addressed to the same area (e£ 
Kleinmuntz, B. Personality Measurement, An Introduction. Home- 
wood, Ill., Dorsey Press, 1967). The organization and style of pre- 
sentation will distinguish Personality ‘Assessment from other 
books rather than the table of contents. 
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In choosing to write an issues-oriented book, the authors will 
disappoint readers who are looking for a comprehensive review 
and evaluation of assessment instruments and technology. Also, 
those who are familiar with the authors’ clinical experience and 
acumen may be unhappy with the absence of instruction in the 
applied arts of how to sensitively accumulate and meaningfully 
synthesize personality assessment data. However, for those who 
want a crisp, timely orientation to the theory and research in 
personality assessment, Lanyon and Goodstein should meet their 
expectations. To criticize Personality Assessment for what it 
fails to include is akin to commenting upon the ways in which 
a horse is an inadequate example of an elephant. 

Lanyon and Goodstein commence with a review of the histor- 
ical precursors of systematic personality assessment—astrology, 
palmistry, and phrenology. For the beginning student who expects 
to be confronted with a mass of intimidating technical terms or а 
treatise on the glorious role of personality assessment in the 
scheme of the world, the initial chapter is a pleasant and useful 
way to slip gently into the troubled waters. One teacher who is 
using Personality assessment with first year graduate students 
reports that the book is “self-steering” in the sense that the mat- 
erial flows smoothly and leads the students to the issues in а 
logical sequence without requiring frequent consultation with the 
instructor. 

Next, the authors examine the fundamental logic of personality 
appraisal, establishing the assumptions underlying the vaious ap- 
proaches to assessment. For example, the rational-theoretical ap- 
proach, which begins with a preconceived plan of how personality 
might (or ought) to be organized, is compared with the empirical 
approach which depends upon the statistical properties of assess- 
ment data patterns. Of particular value is a thoughtful discussion 
of the problems in establishing firm, standard definitions and 
concepts in personality assessment when personality theorists can- 
not agree among themselves upon a definition of personality. 

In Succeeding chapters the authors expand upon current ap- 
Draisal techniques and instruments as these derive from the 
Tational-theoretical and empirical-statistical approaches. Examples 
of tests which illustrate the advantages and disadvantages of each 
approach are discussed. But, no attempt is made to classify or 
discuss all types of personality related appraisal instruments. 
NS example, the authors do not provide commentary upon the 
апаш for personality assessment of vocational interest sur- 
ети, for those readers who are becoming skeptical of the 
us inuing usefulness of either inkblots or true-false question- 

aires, Lanyon and Goodstein devote an entire chapter, and some 
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unmistakable enthusiasm, to behavior sampling in natural set- 
tings, the unobtrusive observational methods, and biographie data 
collection. For the student or researcher who wants а compact 
review of both the research and arguments surrounding the non- 
standard and improvizational personality assessment techniques, 
this chapter could be especially meaningful. 

The now traditional topics of reliability, validity, base rates, 
and clinical vs actuarial problems along with the political, ethical, 
and moral issues in personality assessment are soberly 
and conscientiously treated in several chapters. Congruent with 
the aim of the authors, the statistical and technical terminology 
is held to a minimum both in Personality Assessment and the 
accompanying Readings book. Advanced undergraduates and mo- 
tivated laymen should be able to appreciate the nature of the 
problems without getting the message entangled with the medium. 
For example, in the Readings book section on reliability and 
validity, the most technical selection is an article by Rimm (1963) 
dealing with cost efficiency and production which should not annoy 
any but the staunchest foes of formulae. 

The problems of response styles, response sets, acquiescence, 
social desirability, defensiveness, faking, and the like are dealt 
within a separate chapter. The authors in general avoid taking 
sides on controversies throughout the book; but, in their chapter 
on “response distortions,” Lanyon and Goodstein come close to 
offering a critical review with judgments and conclusions. Per- 
sonality Assessment comes to а close with an examination of 
the impacts upon personality assessment from computer technology, 
operations research, and other management decision-making schema. 

Personality Assessment succeeds in the stated aim of the authors 
to provide a general introduction to personality assessment, and 
to lay the groundwork for clinical and research applications. Also, 
persons in fields allied to psychology and education seeking a rel- 
atively succinct summary and review of the current state of 
affairs in personality assessment should find their needs well met. 
Any practicing clinician who anticipates being cross-examined on 
personality findings in a courtroom trial might hope that the op- 
posing attorneys have not read Lanyon and Goodstein. 

The Readings in Personality Assessment volume edited by Good- 
stein and Lanyon is keyed to coordinate with Personality As- 
sessment. The degree to which the articles sucessfully amplify 
the main thrust of the parent book appears to be at least ade- 
quate. The most commendable features of the Readings book is 
the high density of contemporary material. Of the fifty-four se- 
lections, twenty are of 1966-1971 vintage. While a number of 
the articles are necessarily from sources familiar and convenient 
to the professional personality assessor, there are a few selections 
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from such publications as the Journal of Criminal Law, Criminol- 
ogy, and Police Science which may not be at everyone’s finger- 
tips. There is also one paper written especially for the book by 
Loren and Jean Chapman summarizing their research on beliefs 
of clinicians and “psychodiagnostic folklore.” 

Unlike Personality Assessment which is bright and attractive 
to the eye, and easily entered through the index for selective read- 
ing, Readings in Personality Assessment suffers from a subdued 
type face which flutters late at night and a design shortcoming in 
that the articles follow immediately upon the heels of the pre- 
ceding selection along with an absence of identifying signposts at 
the top of the page to guide the reader who is hunting for an 
article, If one opens the book a page too soon or too late, there 
is no help but to return to the index. On the other hand, any 
publisher able to deliver a hardback book of nearly 800 pages in 
length for under $12.00 in these days is perhaps entitled to cut a 
few corners. 


REFERENCE 
Cronbach, L. J., and Gleser, G. C., Psychological Tests and Per- 
sonnel Decisions. Urbana, Ill: University of Illinois Press, 
1957. (Second edition, 1965). 
Davi А. His 
Wake Forest University 


Frank Restle. Mathematical Models in Psychology: An Introduc- 
tion, Middlesex, England: Penguin Press Ltd., 1971. Pp. 158. 
$2.45 (paperback), 


There are four chapters, in addition to the Introduction, in this 
compact paperback. They are titled Concept Identification, Adap- 
tation Levels in Perception, Probabilistic Choice Theory and Es- 
timation of Parameters. It was the intention of the author, “to 
Introduce the flavor, methods and techniques of mathematical 
Psychology through the intensive development of a few examples.” 

n these four chapters the book does just that in a way that can 
be best described by the word “succinctly.” The concepts in each 
chapter were plainly and parsimoniously presented with no great 
loss in substance. The choice (from among many in mathematical 
Psychology) of each chapter’s content probably allowed the pre- 
Sentation and discussion of ideas to be as simple as possible, thus 
avoiding the quagmire of complex models and subsequent com- 
Plicated parameter estimation. 

The prerequisites for the book are not clearly spelled-out, but 
I think this would be difficult, if not impossible, to do. The book 
Îs а true “introduction” only in the sense that a beginning grad- 
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uate student їп the behavioral sciences, with some calculus back- _ 


ground, could understand the content and techniques with occasional — 


guidance from someone with a strong mathematical statistics back- 
ground. 

Chapter 4, Estimation of Parameters (labeled a “statistical” 
chapter), is the best of the four chapters for clear presentation 
of difficult material. I get the feeling that the entire book could 
have been contained under the heading of Chapter four with the 
other three taken as example cases. This, of course, probably 
reflects this reviewer’s bias toward the value and techniques of 
parameter estimation and should not be viewed as a criticism. 

There is, however, one point of criticism. In Chapter 4 no men- 
tion was made of comparing alternate models to decide on а 
“best” model, yet in Chapter 3 a comparison of the Luce and 
Restle models was made via a goodness-of-fit test. The reader 
could justifiably ask, “Why make a comparison of two models in the 
psychological chapters and then fail to present techniques for 
model comparisons in the statistical chapter?" Some confusion 
could result on the part of the nüive reader concerning “good- 
ness" of parameter estimates and "goodness" of models. I would 
have preferred no model comparison per se since goodness-of-fit 
tests for models are weak at best and typically result in a state- 
ment to the effect that a model was “better” because the researcher 
failed to show it was “worse” than some other model. The use of 
the two levels .025 and .005 on page 106 with the implication that 
« had been set equal to these р values is a poor research practice 
resulting in a drastic reduction in power of the statistical test 
(see Brewer, 1971). 

Ideally, the entire book should be read at a single sitting and 
under no circumstances should the book be considered as anything 
other than a brief familiarization with mathematical models in 
psychology. 


REFERENCE 
Brewer, J. K. On the power of statistical tests. American Edu- 
cational Research Journal. 1971, (In Press) 
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Rodney W. Skager and Carl Weinberg. Fundamentals of Educa- 
tional Research: An Introductory Approach, Glenview, Ш: 
Scott, Foresman and Company, 1971. Pp. 195. (paperback). 


This little book was intended by its authors “to introduce to 
the beginning student in education the methodology of using re- 
search findings, and to give to the teacher the tools to contribute to 
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the body of educational research." (Preface). Regarding the first: 
goal, the authors have drawn together a surprisingly broad array 
of ideas as to what comprises the domain of educational research. | 
The book is written to provide information about the cultural and 
philosophical background of education and educational research; 
about the availability and accessibility of sources and resources; 
about the nature of the learning process and instructional objec- 
tives; about different types of research, such as documentary; histor- - 
ical and experimental; and about various alternative viewpoints 
as to the roles and goals of educational research as perceived by 
teachers, administrators and the researchers themselves. 

A major aim is to persuade the student in education that research 
can, and ought to, have a large role in the entire process of educa- 
tional decisions. The authors may be characterized as holding 
an optimistic view indeed about what applied research can accom- 
plish; their goals in promoting research are likely to appeal to 
many people who claim membership in the community of research- 
ers, but judgments as to whether such optimism is realistic would 
appear to have less consensus. 

In their attempt “to give to the teacher the [tools] of educational 
research," the authors have touched on nearly all bases. Discus- 
sions are to be found on the major varieties of variables in re- 
Search, such as independent, dependent and moderator; on the 
relationships among different types of measurement scales; on some 
of the major types of research designs; on ways to collect, data; on 
the role of randomization in experimentation; on the basic uses 
and limitations of inferential statistical methods; as well as оп а 
variety of concepts in measurement such as reliability, validity and 
the kinds of tests, scales and questionnaire which have been devel- 
oped. “It is not within the scope of this book to deal with the how of 
data analysis, e.g., how a standard deviation is computed” (p. 90). 

In the main, Skager and Weinberg can be judged as having 
achieved their second objective, except that highly compulsive in- 
structors will want to qualify some of their discussions, and even 
to correct some statements (e.g., that “Inferential statistical meth- 
043 are always applied for the purpose of determining the proba- 
bility that the results observed are due to accidental factors such 
ag the assignment of subjects” [р. 97]). К 

t is not surprising in this age to find text-writers using their 
educational research books as podiums from which to lecture about 
their hopes for educational research. That these authors have 
chosen so strenuously to argue for educational change, and for 
el educational research as a means for achieving such “cons 
p leq change”, however, seems to us to detract from the book's 
ù erall virtues. It would seem that many who may find the book 

seful for beginning students in describing the ‘what’ of educational 
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research would argue at length about the authors’ general or spe- 
cific opinions as to proper goals of educational research. The book 
is reasonably up-to-date, although the references do not, unfor- 
tunately, include some of the better recent articles and mono- 
graphs on curriculum evaluation or design and statistics. The 
organization will not appeal to some, but the index is comprehen- 
sive so this problem is not serious. However abbreviated, a great 
deal of territory is covered in this small book. 


LORRAINE К. Nova 
ROBERT М. PRUZEK 
State University of New York at Albany 


Philip E. Vernon. The Structure of Human Abilities. (2nd ed.) New 
York: Barnes and Noble (and Methuen, London). 1971. Pp. 
208. $5.75 and $3.00 (paperback). Reprinted with little change 
from the 1950 edition. 


The author assumes a reader with preliminary knowledge of 
what an intelligence test is and the meaning of a correlation coeffi- 
cient. Based on these assumptions he carefully and sparely devel- 
ops factor analysis as a procedure. The historical application of 
factor analysis as a method to develop structured human abilities is 
next considered. Conflicts between the British School (importance 
of the general factor g) and the American School (independent 
types of abilities) are detailed. In particular the development of 
the hierarchical theory is admirably presented. 

Educational attainment is shown to be influenced by g (defined 
аз“... the education of relations and correlates”), v:ed (the verbal- 
numerical-educational major group factor), and the X factor (de- 
fined as “. . . a complex of personality traits, interests and back- 
ground"). The chapters detailing intellectual faculties as consist- 
ing of g plus v (one of the minor group factors under v:ed (con- 
sidered to be verbal ability) and intelligence test factors are of 
particular interest when analyzing the results of compensatory 
education in the United States. Discussion of the unintentional 
factors (practice, difficulty, speed, etc.) which are inherent with 
the method of testing are introduced. The remaining chapters con- 
sider aesthetic, psychomotor and physical, mechanical, and occu- 
pational abilities. The assumed relationships of these abilities to 
the general intellectual abilities hierarchy is most interesting. 

Two appendices are included in this text. The first, General 
and Group Factor vs. Multiple Factor Theories, is an exceptional 
summary of the arguments. The second, Factor Analysis from 
1950 to 1959, is somewhat disappointing, particularly in contrast 


to the rest of the text. This disappointment stems more from omis- 
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sion than commission (e.g., Kaiser’s Varimax Procedure is con- 
sidered only with a three-line footnote). 

Minor typesetting errors, as expected in a first publication (e.g., 
оп р. 72, Table VII, v:ed is identified as n:ed), are present. The 
most serious criticism with this publication is the binding; after 
several readings the text was literally in pieces. This minor criti- 
cism should not detract from the overall quality of the work. Any 
individual interested in factor analysis and its use who has not 
read the text should avail themselves of the paperback edition. 


ROBERT А. SMITH | 
University of Southern California 


В. J. Winer. Statistical Principles in Experimental Design. (2nd. 
ed.). New York: McGraw-Hill, 1971, pp. xx + 907. $16.50. 


This reviewer has always been intrigued by revisions of popu- 
lar textbooks. It is always interesting to match “wits” with the 
textbook writer to see if the changes in content are those you are 
desirous of seeing. For this reason, this reviewer was looking for- 
ward to inspecting the second edition of Winer's Statistical Prin- 
ciples in Experimental Design. The first edition has been widely 
acclaimed and accepted by most as the classic text on experimen- 
tal design for students in the behavioral sciences. Indeed, the 
highest compliment is paid to the text when students with experi- 
mental design questions are invariably told to “look it up in 
Winer and then come see me.” 

A comparison of the second edition to the original text suggests 
that Professor В. J. Winer has changed his perception of what 
should be included in a “comprehensive text in experimental de- 
sign.” While the majority of the content in the first edition was 
retained (the chapter on incomplete block designs has been dropped), 
the new edition almost looks like a new edition, mainly because of 
the inclusion of a long and comprehensive section on Linear Mod- 
els as Chapter 2. Additional changes include the rearrangement of 
topic Sequence, and the addition of topics such as the multivariate 
analysis of variance, nonorthogonal (1.е., unbalanced) factorial de- 
Signs, and the use of normal equations to define experimental de- 
a The appendixes now include a new section on random varia- 

les, addition material on non-parametric analogues to the analysis 

Е Variance procedures, and additional tables. It is also evident 
at Professor Winer responded positively to reviewer’s comments 

on the first edition by rewriting and/or correcting sections that 

Were pertinent. 

ae noted before, the inclusion of a section on linear models 

a es the text appear new, and it is also one of the most outstand- 
€ features. In somewhat less than 100 pages, Winer develops 
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the topic of linear models from the basic no-distributional as- 
sumptions of simple linear regression to hypothesis testing using 
the general linear hypothesis model in the full multivariate regres- 
sion situation. Also represented are the rarely seen in behavioral 
science literature topics of the generalized inverses, the inverse 
sweep (SWP) computing procedure for estimating effects, and the 
rationale and procedures for obtaining uncorrelated dependent 
variables in univariate and multivariate regression situations. 

While some minor typographical errors were noted in the new 
content, no major content related errors were detected. However, 
this reviewer wondered why the topics of the multivariate analy- 
sis of variance and power were emphasized in the text if there was 
no discussion of the relationship of statistical significance of the 
overall multivariate test and subset dependent variables under 
the null hypothesis or the very important relationship between 
power and the number of dependent variables in a multivariate 
test. One of the difficulties in working through Chapter 2 is the 
use of procedures that have not yet been introduced in the text 
to obtain solutions in examples. For example in one of the earliest 
sections in Chapter 2, the SWP operation is used to solve for the 
least square weights bo and bı, but is not discussed until some 50 
pages later. Also, the principle underlying hypothesis testing is 
used in the second chapter but not discussed until the latter part of 
the third chapter. When this fact is noted along with Winer's 
comment that “... the reader may omit many or indeed all of the 


sections in this chapter without any serious loss in readability of Ж 


what follows,” this reviewer wonders why this material was pre- 
sented in the body of the text and not as an appendix or if, indeed, 
why include it in the book at all? The answer to this question 
would likely be that the linear model provides the basic frame- 
work of which the analysis of variance is a special case and thus 
the theory on the linear model should be presented in any com- 
prehensive text on experimental design. 

It seems relevant to ask the question of where is this text likely — 
fit in terms of usage. As noted before, the first edition had become 
the ultimate reference, "the bible,” of analysis of variance proce- 1 
dures for experimental data collected by behavioral scientists. — 
However, it has proven to be less than satisfactory as a text for а 
second course in data analysis. It appeared to this author that 
the overwhelming amount of content given for a topic and the 
lack of a common thread to tie together the major topics turned 
off many students. The imposition of the second chapter with an 
increased mathematical difficulty level, even though not required 
would appear to make the text even more foreboding to students | 
who must master the content in order to meet curricular require- 
ments. 
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Another concern relates to the lack of consideration of com- 
putational efficiency of the procedures presented in the textbook, 
with the exceptions previously noted in the section on linear 
models. In this day when just about everyone has access to а com- 
puter and where many schools of education at major universities 
have their own computers, it seems inconceivable to fail to note 
the generalized computing algorithms available which efficiently use 
the computer’s capabilities to conduct an analysis of variance. In- 
deed, the proliferation of published computer program abstracts 
in journals such as Educational and Psychological Measurement 
for the one way ANOVA, factorial design, repeated measures, 
ANOCOV with one and several covariants, tends to suggest they 
are unique procedures when the theory underlying the linear mod- 
els shows that each can be run within the framework of the same 
generalized regression analysis program. 

It seems fair to say that this text will also achieve the status of 
a classic in the field of experimental design in rather short order. 
Further, while the grand result of this revision is a substantially 
larger text (907 pages for the second edition versus 672 pages 
for the first edition), it must surely provide the most comprehen- 
sive selection of topics related to the statistical aspects of experi- 
mental design of any textbook in print. This text belongs on the 
shelf of any behavorial and social scientist who is interested in 
research. 


Joun L. Wastx DEEN 
North Carolina State University 


5 


> 


EDUCATION AL md MEASUREMENT 


X — Editor: W. Scott Gehman 
Managing Editor: Geraldine R. Thomas 


= 


BOARD OF COOPERATING EDITORS 


Dororny C. Aprins, University of Hawaii 


Lewis R. AIKEN, Guilford College 

Haroun P. Веснтоіт, The University of Iowa. 
WILLIAM V. Сремамв, Science Research Associates, Inc. 
Lovis D. Conen, University of Florida 


Junius A. Davis, Educational Testing Service 
Haroto А. EDGERTON, Performance Research, Inc. 
Max D. ENoELHAnT, Duke University 
Gene V Grass, University of Colorado 
J.P. бопково, University of Southern California, Los Angeles 
Joun А. Hornapay, Babson College NS. 
Joun E. Horrocks, The Ohio State University 
Cyan J. Ноут, University of Minnesota, 
Милом D. JACOBSON, University of Virginia е 
ЈоѕЕРН C. JOHNSON П, The University of Connecticut 
ILLIAM G. KarzeNMEYER, Duke University 
OBERT E. Lana, Temple University 
T F. LiNpQuisr, State University of Iowa 3 
REDERIC М. Lorp, Educational Testing Service Е Diego 
Roz Lusin, Navy Medical Neuropsychiatric Research Unit, San Diegi 
Louis І, McQuirry, University of Miami, Coral Gables les 
WILLIAM В. MICHAEL, Universit of Southern California, Los Angel 
OWARD G. MILLER, North Carolina State University at R 
ma Мопанлмиам, City Colleges of Chicago 
LLIS В. Раск, The University of Connecticut 
врчвову S. Rasu, Science Research Associates, Inc. 
К Н. Romine, JR., University of North Carolina at Charlotte 
BRENDON 8мітн, The University of North Carolina at Greensboro ‘ll 
j LMA С. THURSTONE, University of North Carolina at Chapel H: 
DERT A. Toors, The Ohio State University, . 
LAUD G. WARRINGTON, Michigan State University. 
К L. Wasix, North Carolina State University at Raleigh 
ТЫМ Мако Wurre, University of North Carolina at Chapel Hill 
EG E. Ууплламз, Wake Forest University 
Ө. WILLIAMSON, University of Minnesota 


VOLUME THIRTY-TWO, NUMBER THREE, AUTUMN 1972 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1972, 32, 543-557. 


RATING SCALES AS MEASURES OF CLINICAL JUDG- 
MENT III: JUDGMENTS OF THE SELF ON PERSONALITY 
INVENTORY SCALES AND DIRECT RATINGS? 


JAMES BENTLEY TAYLOR, MARIANNE PTACEK, 
MARTHA CARITHERS, CAROL GRIFFIN AND 
LOLAFAYE COYNE 


The Menninger Foundation 


Tur first paper in this series provided a background and 
rational for a new scale construction method: example-anchored 
scaling (Taylor, 1968а). Our initial application was with folin- 
ical judgment,” in which one person rated another on a variable 
of clinical interest. The method was used to construct a parti- 
cular kind of rating scale: a thermometer-like line divided into 
100 points, with discrete way points being anchored by specific 
examples. Most of our reported applications have employed case- 
history vignettes as anchors, but photographs have also been 
used, and other kinds of examples (drawings, test responses, 
attitude statements, etc.) are possible. 

Briefly stated, construction of an example-anchored scale begins 
by defining, as concisely and clearly as possible, the domain 
of behavior to be measured. A set of 30 or so examples (items) 
are accumulated so as to sample the full range of the domain. 
Five or six judges rank order the examples in terms of the de- 
fined content. The degree of order—i.e., the agreement in rank- 
ings—is evaluated statistically by Kendall’s W (1962). It has 
been shown (Taylor, 1968b) that this brief judgmental method 


дд 
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can give results highly comparable (г = 294, 95) to the results 
obtained by successive interval scale methods. The items are 
held to exhibit a satisfactory degree of unidimensionality (1.е., 
approach total order) if W is significant, if the average interjudge 
agreement is > .75, and if the predicted stability of mean ranks 
is > .95. If the items meet these criteria, the scale vector is as- 
sumed to be unidimensional and scale construction proceeds. A 
smaller set of items is chosen to provide anchoring examples along 
the full range of the scale—generally the most agreed-upon items 
are used for this purpose. 

In previous reports we have shown that the use of such scales 
can lead to a marked increase in interjudge reliability, whether 
the judges are clinicians evaluating personality from a single 
interview, or untrained community workers reporting on a series 
of contacts (Taylor, 1968a; Taylor et al., 1970; Taylor, 1971). 
Whereas numerical rating scales show a typical interjudge re- 
liability in the = .40-.60 range, example-anchored scales typically 
show reliabilities in the .70-.99 range—and this with untrained 
raters. The method thus appears to offer advantages when one in- 
dividual is asked to make complex “clinical” judgments of another. 

Example-anchored scaling is not restricted to such applications. 
The initial paper suggested a second area of potential applica- 
tion: personality measurement based on self-report. 

It is conceivable that a respondent may be able to place 
himself upon а specified personality dimension using an 
example-anchored scale. Obviously there are limitations to 
this approach: clinicians have dimensions in their thinking 
unknown to their patients, and subtle items in personality 
scales provide unique cues. On the other hand, some per- 
sonality dimensions should be amenable to direct self-rating, 
and therefore to example-anchored scaling. Although example- 
anchored scales lack subtlety, it is noteworthy that such 
tests as the MMPI discriminate best with their most obvious 
items (Duff, 1965). The use of example-anchored scales in 
personality measurement thus seems worth exploring (Taylor, 

1968a, p. 764). 

This paper reports such an exploration. We here compare а set 
of multi-item personality scales with a set of comparable example- 
anchored scales. Our findings lead us to re-examine the perpetual 
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problems—“What do personality scales measure?”—and to ques- 
tion some traditional assumptions. 


Analyzing Two Measurement Strategies 


Following a multitrait-multimethod strategy, a variety of traits 
are here measured by comparable multi-item scales, and by 
comparable example-anchored rating scales. The intercorrelation 
of these various measures provide the basic data for a comparison 
of convergent and discriminant validity across methods, and across 
content domains. 

The distinction between convergent and discriminant validity 
follows the discussion by Campbell and Fiske (1959). Convergent 
validity refers to the correlation between different tests which 
ostensibly sample the identical content domain. Do two tests 
which supposedly measure “the same thing” correlate highly? If 
so, they exhibit convergent validity. A group of tests may show 
good convergent validity, but be useless because they also cor- 
relate highly with other tests outside their domain of content— 
e.g., they may lack discriminant validity. “Measures of the same 
trait should correlate higher with each other than they do with 
measures of different traits involving separate methods. Ideally, 
these validity values should also be higher than the correlations 
among different traits measured by the same method.” (Campbell 
and Fiske, 1959, p. 104.) 


Variables and Their Measurement? 


Wiggens (1966) has used the MMPI item pool to develop 18 
scales which tap various content domains. The scales are “inter- 
nally consistent, moderately independent, and representative of 
the major clusters that appear to exist in the total MMPI item 
pool.” These 13 scales, plus the MMPI “L” scale, define the key 
content areas investigated in this study. Table 1 lists these content, 
Scales. 


~ 

2A full description of all scales used, the set of multitrait-multimethod 
matrices described herein, and the rotated matrix of factor loadings found in 
this „study have been deposited with the National Auxiliary Publications 
Service. Order Document #01836 from the National Auxiliary Publications 
Service, c/o CCM Information Sciences, Inc., 909 3rd Avenue, New York, 
М. Y. 10022, remitting in advance $2.00 for microfiche or $5.00 for photocopies. 
Make checks payable to: CCMIC-NAPS. 
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TABLE 1 
Comparable Self-Report Measures 


2,17 
3,18 


4,19 


5,20 


6,21 


7,22 


8,23 


9,24 


10,25 


11,26 


Scale Question 


Example-anchored Rating Scales 


Polar Anchors 


MMPI Scale 


(Wiggens, 1966) Other Scales q 


How self-confident Very self-confident, 31 MOR 45 Self-Confidence 
do you usually feel? оз. Not very self- (Poor Morale) (Edwards, 1967) 
confident 
How good is your Good health, vs. 32 HEA 46 Health 
health? Poor health (Poor Health) (Bell, 1963) 
How much satisfac- Dissatisfaction with 33 FAM 47 Home Adjust- 
tion and content- family, friction and (Family Prob- ment (Bell, 1963) 
ment do you feel disharmony, vs. lems) 
with your family? Contentment and 
satisfaction with 
family 3 
How conscientious Extremely moral, 34L 48 Marlowe-Crowne 
do you think you conscientious and Social Desirab- 
are? virtuous, vs. Not ility (Crowne 
highly moral, or and Marlowe, - 
virtuous 1964) | 
How willing are you Strong tendency to 35 HOS 49 Friendliness 
to express anger? express anger, vs. (Manifest (Guilford and _ 
Unwillingness to Hostility) Zimmerman, 
express anger 1949) 
How restless or Restless, enthusi- 36 HYP 50 Impulsivity __ 
excited do you astic, easily excited, (Hypomania) (Jackson, 1967) 
become? vs. Placid and calm 
How much is your Strong tendency to 37 PHO 51 Harm ayoidan 
life disrupted by fearfulness, fears (Phobias) (Jackson, 1967) 
fears? interfere much with 
life, vs. Relative lack 
of fear, fears don’t 
interfere with life 
How religious are Strongly religious, vs. 38 REL 52 Religious 
you, in a funda- Not very religious, in (Religious Orientation 
mentalist sense? а fundamentalist sense fundamentalism) (Heist and 
Yonge, 1968) _ 
In general, how Cheerful, vs. 39 DEP 53 Emotional Stal 
cheerful do you Depressed (Depression) ility (Guilford 
usually feel? and Zimmer! 
1949) ) 
How comfortable and Comfort, confidence, 40 SOC 54 Sociability = 
confident do you feel enjoyment around (Social mal- (Guilford and | 
with other people? ^ people, vs. Discom- adjustment) Zimmerman, _ 
fort, self-consciousness, 1949) 
and worry around 
people 3 
How would you Tough, vs. sensitive 41 FEM 55 Masculinity | 
describe your kinds (Feminine (Guilford and 
of interests in terms of interests) Zimmerman, 
1955) 


toughness or sensi- 
tivity? 
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TABLE 1 (Continued) 


Example-anchored Rating Scales 


— MMPI Scale 
Scale Question Polar Anchors (Wiggens, 1966) Other Scales 
12,27 How smoothly do Thoughts run 42 ORG 
your thoughts run? smoothly, vs. Thoughts (Organic 
disrupted Symptoms) 
13,28 How anxious do you Very anxious, vs. 56 Anxiety level 
feel? Very calm (Heist and 
Yonge, 1968) 
14,29 How much do you Troubled by mental 43 PSY ah 
experience mental problems, vs. Not (Psychoticism) 
problems? troubled by mental 
problems 
15,30 How much do you Not guided by usual 44 AUT i= 
feel you are guided ^ standards of right (Authority 
by the usual stand- and wrong, vs. Conflict) 
ards of right and Guided by usual 
wrong? standards of right and 
wrong 
57 Autonomy 
(Jackson, 1967) 
58 Defendence 
(Jackson, 1967) 
59 Psychiatric 


screening 
ЕУ (Langner, 1962) 

A second set of multi-item scales was chosen from various 
sources to measure “the same” content areas. The difficulties in 
selecting such scales are well known. Scales which supposedly 
measure the same thing may vary markedly in item content, while 
seales with different names may tap the same content domain. For 
example, Butt and Fiske (1968) have shown wide content vari- 
ability in a group of scales designed to measure dominance; Taylor 
and Reitz (1968) have shown the same variability in measures 
of “self-esteem.” For the current study, a number of personality 
inventories were screened to find scales which, by name or de- 
scription, seemed comparable to the key MMPI content mea- 
sures. When a promising scale was identified, its items were read 
and compared to the items of the key scale. The comparisons 
Were made jointly by the principal investigator and by a research 
assistant; the scales judged to have the most similar items were 
designated as “comparable.” This procedure, while admittedly 
ad Вос, accords with Jackson’s recent admonition: “One need 
Aot abandon as unscientific his unique capacity to judge and 
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weigh content” (1971, p. 247). These scales and their sources 
are also listed in Table 1. No comparable multi-item measures 
were found which adequately matched three of the Wiggens’ con- 
tent scales: Org—Organie symptoms, Psy—Psychoticism, and Aut 
—Authority Conflict. These Wiggens’ scales were, however, re- 
tained in the analysis. An additional noncomparable multi-item 
scale, “Anxiety level,” was added to the group as a highly gen- 
eral measure of psychological distress. Finally, three other scales 
of rather mixed content were included, as shown in Table 1. 

These 15 content areas were also each measured by two “com- 
parable” example-anchored rating scales, using personality scale 
items as examples and developed according to the procedures 
described previously (Taylor, 1968a). With one exception, all 
example-anchored scales met the criteria advanced previously; 
the exception showing a mean interjudge agreement of .74 rather 
than .75. Figure 1 shows a single such example-anchored scale, 
while Table 1 lists the defining scale question and polar anchors. 

The study, then, used 59 different instruments to measure 15 
different content areas. Eleven of the areas were measured four 
times, twice by comparable multi-item scales, and twice by com- 
parable example-anchored scales. Four other content areas were 
measured by a single multi-item scale and two example-anchored 
scales, while three “mixed” multi-item scales were included which 
seemed to overlap with several other measures. Intercorrelating 
these 59 scales produced two multitrait-multimethod matrices, 
with replication within designated measurement methods and con- 
tents. 


Participants and Testing 


The various scales were given, as part of the regular intake 
evaluation procedures, to 125 consecutive patients newly admitted 
to the inpatient or outpatient facilities of Topeka State Hospital. 
Fifty-seven (45 per cent) of the sample were male, and 86 (60 
per cent) were between the ages of 14 and 29. The majority (58 
per cent) were inpatients: and the overwhelming majority (94 
per cent) were Caucasian. All were resident in the Topeka met- 
ropolitan area. 

Testing was done over a two-day period, with one to three days 
intervening between testing sessions. The first day the patients 
received one set of example-anchored scales and the MMPI, 


Very self-confident 


80 


70 


40 


30 When I meet a stranger I often think that he is ‘better than 1 
am. I feel like giving up quickly when things go wrong. 


20 


10 


о 


Not very confident 
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When I make a decision I usually stick to it. People seem 
naturally to turn to me when decisions have to be made, Т 
can make up my mind in making various decisions without too 
much trouble. 


Other people usually like my ideas. I would say that I was 
popular with people. I have very few fears compared to my 
friends. 


I'm not doing as well in my job (or school, etc.) as I'd like 
to. I'm often sorry later for something I've done, І have 
vivid dreams disturbing my sleep. 


I often feel upset at my job (or school, or family). I often 
get discouraged about what I'm doing. Usually no one pays 
much attention to my thoughts and ideas. 


I cannot do anything well, I have à rather low opinion of 
myself. The future seems hopeless to me. 


Figure 1. How Self-Confident Do You Usually Feel? 


along with a battery of other group administered tests. At the 
Second session they received a booklet containing the “comparable” 


multi-item 


scales and the second set of example-anchored scales. 


The two sets of example-anchored scales were presented in count- 
erbalanced order. 


Analysis 


The 59 different measures were intercorrelated, and the сог- 
Telations arranged into multitrait-multimethod matrices to facili- 
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tate inspection of discriminant and covergent validity. Although 
general methods for comparing convergent and discriminant vali- 
dities have been proposed (Kavanagh et al., 1971; Boruch et al., 
1970; Stanley, 1961), they provide only a rough approximation 
of the Campbell-Fiske criteria, and were not used here. The cor- 
relation matrices were further analyzed by the principal axis 
method, using split-half reliability estimates as diagonal elements 
for the multi-item scales and the average scale correlations (as 
estimated from the coefficient of concordance) for the example- 
anchored scales. Since the L scale overlapped with the MMPI 
content scales, it was not included in the factor analysis. All factors 
with eigenvalues exceeding unity were rotated to the Normal Vari- 
max criterion. 


Results 


The Multitrait-Multimethod Comparison 


Analysis of а multitrait-multimethod matrix involves com- 
parative examination of diagonal and off-diagonal correlations. 
The diagonal elements provide information about convergent 
validity—how much do “comparable” scales really measure the 
same thing? The off-diagonal entries provide information about 
discriminant validity—how much does the scale correlate with 
measures outside its claimed domain? Obviously the higher the 
diagonal values the better, since these establish the convergent 
validities. Obviously, the lower the off-diagonal values the bet- 
ter, since low values establish discriminant validity. If a set 
of scales has both convergent and discriminant validity, every 
diagonal value would be greater than every off-diagonal value 
in its row and column. (Instances in which an off-diagonal value 
is greater than its relevant diagonal value will be called “ano- 
malies” henceforth.) 

Tables 2 and 3 summarize the comparisons among diagonal 
and off-diagonal correlations. Table 2 shows the mean diagonal 
coefficients within and between measurement methods, while 
Table 3 shows the mean off-diagonal coefficients within and ђе- 
tween measurement methods. Thus Table 2 provides data for 
assessing convergent validity; Table 3 data for assessing dis- 
criminant validity. 

Inspection of Table 2 suggests that the example-anchored 


L4 


MEMMIUS ааа г 


JAMES BENTLEY TAYLOR, ЕТ AL. 551 


scales showed reasonable convergent validity, both with compar- 
able example-anchored scales and with comparable multi-item 
scales. Convergent validity appears to be higher for the comparable 
multi-item scales than for the example-anchored scales. Some 
traits also appear to have higher convergent validities than 
others. Possible trait and method differences were tested for sig- 
nificance (in the first 11 content areas) by a two-factor analysis of 
variance. Both the Trait factor and the Method-Pair factor were 
significant, р < .001 and р < .025 respectively. Following the 
significant Method-Pair main effect, individual differences among 
the Method-Pairs columns (a)-(f) were tested by the Newman- 
Keuls test. The only significant differences found were between 
column (b) (p < .05) and both columns (d) and (f). The com- 
parison of columns (a) and (b) was not signifiant. However 
nine of the 11 convergent validities in column (b) теге larger 
than the comparable ones in column (a); by the sign test this 
was significant at p < .07 (two-tailed test). In a word, con- 
vergent validity between the two sets of multi-item scales was 
significantly higher than between the example-anchored scales 
and one set of the multi-item scales. The difference in convergent 
validity between the two sets of multi-item scales and the two 
sets of example-anchored scales approached significance with one 
method of analysis. Thus the comparison of convergent validity 
favors the multi-item scales although the difference does not 
teach the usual standards of significance. 

The significant Trait factor found in this analysis indicates 
that different content areas vary markedly and consistently in 
the degree to which they are convergently measured by “com- 
parable” instruments. These differences are independent of the 
measurement technique, as indicated by the nonsignificant Trait 
X Method-Pair interaction. Thus the rank order correlation 
between columns (a) and (b) in Table 6 is .70, significant 
to p. < .05. Similarly, the rho between columns (с) and (f) 
is .81, and between columns (d) and (e) is .77, both significant 
P. < 01. Convergent validity is here more influenced Бу differences 
among traits than by differences in measurement techniques. 

Table 3 shows the mean off-diagonal coefficients within and 
between measurement methods. Here, the higher the average co- 
efficient, the poorer the discriminant validity. 
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TABLE 2 
Diagonal Coefficients (Convergent Validities) by Trait and Method 


Method 


Multi- 


Item 1 Example- Example- 
Example- (MMPI), Anchored 1, Anchored 2, 
Anchored 1, Multi- Multi- Multi- 
Example- Item 2 Item 1 & 2 Item 1&2 


Anchored 2 (Other) 


(c) (d) (e) 


Trait (a) (b) Multil Multi 2 Multi 1 
1. Self-Confident .623 .653 .677 .635 .542 
2. Health .546 .739 „484 .500 .603 
3. Family .464 .784 .526 „475 .546 
4. Conscientious (L) .451 .566 .306 .359 .217 
5. Anger .333 .564 .418 414 AMT 
6. Restless, hypo .345 — .195 .227 — .138 .259 = 
7. Fears, phobias .540 .398 441 :817 ATT 
8. Religiosity .722 .828 .759 .713 774 
9. Depression .649 .818 „647. .629 .642 
10. Social Comfort .690 .887 .752 .737 .720 @ 
11. Mase. Interests .325 „686. .520 .542 .423 .307 
Мет .517 „612 „523 .471 .514 .457 
12. Thought .556 — .535 — .487 
13. Anxiety .396 — — .547 — 
14. Mental Problem .421 — .456 — .504 
15. Right-wrong 444 — .396 .374 


We may also compare the number of “anomalies” found within 
the off-diagonal coefficients: the more frequent the anomalies, 
the poorer the discriminant validity. The matter is made complex 
in this instance, however, by the finding that the two multi-item 
measures of “excitability” were completely noncomparable, so that 
this one scale produces an undue number of anomalies. In what 
follows, therefore, this one content dimension is not included. 

Inspection suggests that the comparable example-anchored scales 
had a higher discriminant validity (the mean off-diagonal coefficients 
ranging from .134 to .151, with a total of three anomalies); and 
that the comparable multi-item scales had relatively poor dis- 
criminant validity (the mean off-diagonal coefficients ranging from 
.209 to .283, with a total of 18 anomalies). The difference in number 
of anomalies is significant p. < .01, by the chi square test. Thus the 
comparison of discriminant validity significantly favors the ex- 
ample-anchored scales. 
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TABLE 3 
Mean Off-Diagonal Coefficients, by Method 


Example-Anchored Scales Multi-Item Scales 
Set 1 Set 2 
Set 1 Set 2 (MMPI) (Other) 
Example-Anchored 1 .137 — — — 
2 .140 „151 == c 
Multi-Item 1 .155 .153 .283 ES 
2 .150 .165 .226 .209 


The Factor Analyses 


In view of these findings, it seemed useful to examine the under- 
lying factor structure of the matrix. Using the procedure described 
earlier, nine orthogonal factors were extracted. The marker scales 
and suggested designation for each factor are shown in Table 4. 

Notable here is the fact that eight of the nine factors are 
clearly marked by clusters of comparable content scales, and 
that no clear methods factor emerged for the example-anchored 
scales. Several of the eight content factors were marked equally 
well by the example-anchored and the multi-item scales. 

The ninth factor is clearly unrelated to content per se. It has 
16 loadings > .35, all occurring with the multi-item scales. It 
thus seems reasonable to interpret it as a methods factor. This 
interpretation is supported by the fact that the Marlowe-Crowne 
Social desirability scale has its highest loading on this factor. 
When the multi-item scale loadings are correlated with the per- 
centage of items keyed “true” in each scale, the correlation is 
622 (p < .001). These results suggest that the ninth factor com- 
bines acquiescence response set and one form of social desirability 
Tesponse set into a single “methods” dimension. 

The clarity, simplicity, and interpretability of this factor solu- 
tion stands in marked contrast to many other analyses reported 
in the literature. Partly this results from the repeated measure- 
ment of comparable variables, which tends to produce a more 
articulated factor structure. Partly it results from having the con- 
tent of the scales clearly defined, so that the factors also become 
more clearly definable. Perhaps for these reasons the present 
anlaysis identified nine interpretable content factors; while 


554 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 4 
Factors and Descriptions 
Range of 
Factor Major Marker Items Loading Designation 


I All comparable self-confidence scales (.66-.71) Confident cheer 
(1, 16, 31, 45) 


All comparable cheerful-depressed (.69-.74) 08. 
scales (9, 24, 39, 53) 
All comparable anxiety scales (13, (.49-.68) Depressive anz- 
28, 56) iety and self- 
Item anchored thought dis- doubt 
ruption scales (12, 27) (.70-.73) 

п All comparable religion scales (8, 23, (.82-.90) Religious Funda- 
38, 52) mentalism 

III 3 of 4 comparable anger control scales (.48-.55) Conscientious 
(5, 20, 49) Self-Control 
Item anchored conscientious scales (.53-. 66) 
(4, 19) 
Item anchored standards scales (15, (.47-.53) 
30) 

IV All comparable masculine-feminine (.54-.76) Masculine- 
scales (11, 26, 41, 55) Interests 

У АП cesar family scales (3, 18, (.62-.76) Family Harmony 
33, 


VI All comparable social comfort scales (.62-.70) Social Comfort 
(10, 25, 40, 54) 

VII All comparable physical health scales (.66-.76) Physical Health 
(2, 17, 46, 32) 
MMPI Organic symptoms scale (42) (.64) 

VIII Item anchored mental problems scales (.39-.55) Disruptive Idea- 


(14, 29) tion 
Three of four fears scales (7, 22, 37) (.45-.58) 
Ix 14 Multi-item scales (.40-.74) Methods Response 
Set 


Wiggens (1966), using the 13 MMPI content scales, identified 
only three. 

These factor results clarify the higher convergent validity, 
and lower discriminant validity of the multi-item scales. The 
majority of multi-item scales had loadings on a general methods 
factor. Such shared factor variance led to higher correlations 
between multi-item scales, and hence to higher convergent valid- 
ities and lower discriminant validities. In short, the higher con- 
vergent validity between multi-item scales does not mean they - 
are more effective measures; it only means that they share & | 
common kind of “noise,” a common kind of content irrelevant 
variance. 


* 
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Summary 


We have shown that example-anchored scales provide a fea- 
sible alternative to multi-item scales in assessing personality Бу 
self-report. The example-anchored scales are capable of measur- 
ing the same content domains with the same degree of factorial 
clarity. Unlike the multi-item ssales, example-anchored scales 
appear to be remarkably free of methods-linked bias, and so 
provide a “purer” kind of measurement. Their brevity offers a 
further advantage: the single set of example-anchored scales 
averaged 15 minutes per administration; the comparable MMPI 
scales averaged 103 minutes. The sole seeming-advantage of the 
multi-item scales—their higher convergent validity—turned out 
on closer examination to be an artifact, arising from shared method 
or response set variance. 

The findings are surprising in view of current beliefs, holding 
that single items or single responses are inherently unreliable, and 
that scale reliability can only be achieved by adding together a 
large number of item responses. Perhaps the same beliefs have 
discouraged empirical study of the efficiency of single, simple 
self-ratings, even though occasional reports in the literature sug- 
gest the usefulness of self-rating methods (e.g. Hase and Gold- 
berg, 1967; Peterson, 1965; Wilson, 1967; Campbell and Fiske, 
1959; Taylor and Parker, 1964; Carroll, 1952; Riker, 1944). Not 
commonly realized is the fact that two routes exist to reducing the 
effects of unique and error variance in а measure: one can 
“cancel out” the effects of such nonsystematic variance by a 
Process of signal averaging, or one can devise a single measuring 
instrument which is itself relatively free of such extraneous 
“noise.” The usual multi-item seale represents an application of 
the signal-averaging strategy; we have here demonstrated that the 
Second strategy is also feasible, and may offer certain advantages 
to personality measurement. 
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А NEW APPROACH TO RESPONSE SETS IN ANALYSIS 
OF A TEST OF MOTIVATION TO ACHIEVE: 


DOROTHY C. ADKINS 
University of Hawaii 
BONNIE L. BALLIF 

Fordham University 


GUMPGOOKIES is an objective-projective test of motivation to 
achieve in school, intended for children in an age range of three 
and a half to eight or possibly nine. Each item consists of а 
description of two imaginary figures called gumpgookies, and the 
task of the child is to indicate with which gumpgookie he identi- 
fies, 

For example: 

These gumpgookies should be working. 
This one is watching. 
This one is working. 

The first form of the test consisted of 100 items in which 
illustrations of gumpgookies were presented in left-hand or right- 
hand positions and in which the left-hand figure was always 
described first, 

Factor analyses of data from this first form yielded factors 
Which, although Suggestive of substantive interpretations, seemed 
to be influenced by the positions of the answers and/or primacy 
Versus recency, ie, whether the keyed answer was presented 
first or last, Thus some factors were loaded for items with 
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answers predominantly in the right-hand position, which were pre- 
sented last, and some for items with answers predominantly in the 
left-hand position, which were presented first. 

In an effort to dissipate the effects of these response sets of the 
subjects, a new format was devised, whereby the alternatives 
were presented in varying positions—up and down, left and right, 
upper left and lower right, and upper right and lower left. At 
the same time, the order in which the alternatives were presented 
by the examiner was randomized. The number of items was 
reduced from 100 to 75, and most of the alternatives were 
revised to reduce their cognitive or verbal difficulty. This test 
was prepared in three versions, an individual form, a group form 
for nonreaders, and a group form for readers. 

A previous report to the Office of Economie Opportunity, availa- 
ble through ERIC, described a number of factor analyses of data 
on these forms of the Gumpgookies test that had been completed 
by November, 1969 (Adkins and Ballif, 1970a). Although the 
results were interpreted in terms of substantive factors, the in- 
terpretations were still clouded by the troublesome influence of 
two main types of extraneous influences or response sets: the 
effects of the positions of the answers to the items and the in- 
fluence of the order in which the keyed answer is presented. A 
later publication presents the hypotheses on which the test is 
based and discusses their relation to empirically determined factors 
for the test in randomized format (Adkins and Ballif, 1970b). 

In retrospect, it appears that the effort to get rid of the effects of 
response sets by means of revising the format was not success- 
ful. Extraneous influences were still in evidence and had only be- 
come somewhat more difficult to detect. Parenthetically, it should 
be noted that these response sets have no systematic undesirable 
influence on total score on the test, because the subject is expected 
to get only a chance score on items to which he responds on the 
basis of a particular set. But the response sets do affect the 
items that are loaded on particular factors, so that a subject 
could get unwarrantedly high or low scores on the separate 
factors. Moreover, the effects of response sets on the composition 
of the factors made their interpretations very tenuous. 

Since, disappointingly, the change in format had not been 
successful, another type of solution to the response set problem 
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had to be found before factors could be interpreted with any 
assurance. The next approach that was pursued was based on 
the idea of computing response set scores for each subject, par- 
tialling these out of the item intercorrelation matrix, and fac- 
toring the resulting matrix. Hence for each subject were com- 
puted the number of answers he chose that were in the left-hand 
position, the number of answers he chose that were in the up 
position, and the number of answers he chose that had been 
presented first. In the case of the items in which alternatives 
had been placed in a diagonal position, e.g., upper left and lower 
right, an arbitrary decision was made to regard upper left and 
upper right as up, lower left and lower right as down. This was 
done because the small numbers of items involved in the two 
diagonal positions would have resulted in Tesponse set scores 
of very low reliability for these positions, 

The problem of developing the mathematical solution for par- 
ба пр these three variables out of an item intercorrelation matrix 
was presented to Dr. Paul Horst, whose technical report will 
appear later. The computer program was worked out by Renato 
Espinosa and Robert Bloedon, members of the Hawaii Education 
Research and Development Center staff, with the guidance of Dr. 
Horst. The result is a program that yields orthogonal factors that 
are completely uncorrelated with the response set scores. 

The complete program includes routines that provide, among 
other things, the correlations of each item with the response set 
Scores, the rotated "partial" factor loadings for each item, and 
reliability estimates (KR-20) for each partial factor as well 
as for the response set scores. It prints, for each item for each 
Partial factor, approximate integral weights of —1, 0, 1 that 
could be used in hand scoring to yield approximate factors scores, 
А weight of —1 for an item indicates that it is functioning as a 
Suppressor variable. The program also yields exact factor scores 
for each Subject, based upon regression weights for each item. 
The approximate factor scores correlate in the neighborhood of 
90 with the exact scores. It should be understood that, for any 
group of subjects sufficiently large to warrant a factor analysis, 
exact factor scores would be used if separate factor scores were 
desired, However, if a factor analysis resulting in approximate 
integral weights is available for a large sample and an investigator 
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has only a small sample of the same kind of subject, the solution 
of approximate factor scores by means of the integral weights 
might be serviceable. These could be obtained by hand scoring. 

For the first factor analyses that had been run on the earliest 
form of the test, the number of factors was not specified, and it 
was naively supposed that the computations should be con- 
tinued until the eigenvalue dropped below unity. It was found, 
however, that this had not occurred even when some forty factors 
had been extracted. It was clear that each of such a large number 
of factors would be determined primarily by so few items as to 
make their interpretation impossible. Moreover, because of the 
theoretical basis of the test, it was thought that probably no more 
than five or at most six factors would be interpretable. Hence 
the later analyses are largely confined to five factors, although 
some solutions with three, four, six, and eight factors were obtained. 

Separate analyses were made for 1813 four-year-old children, 
for 128 first-graders, for 122 second-graders, for 250 first- and 
second-graders combined, and finally for a total group of 2313 
children, Not surprisingly, the KR-20 values for the partial fac- 
tors tended to be higher for the older children. It should also 
be mentioned that for all groups the KR-20 values for the 
partial factors tended to be less than for the factors based on the 
zero-order correlation matrix. This is doubtless true because the 
latter factors include reliable effects of response sets. Response 
set scores were more consistent for the older children. It was 
also interesting to find that factors for the older children showed 
relatively more influence of a primacy-recency set, those for the 
younger children more influence of answer position sets, as indicated 
by correlations of response set scores both with individual items 
and with “unpartial” factor scores. 

Details of the extensive work that was done in comparing 
the several solutions for different numbers of factors and for 
different groups, as well as in comparing partial factors and 
unpartial factors, will not be presented here. They would only 
overwhelm the reader, as indeed they often threatened to do with 
the investigators. It had soon become apparent, with respect to 
both the original unpartial factors and the partial factors, that 
those for the four-year-olds did not correspond to those for the 
first- and second-graders as closely as had been expected. It 
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TABLE 3 


_. Correlations among Loadings of Five Factors Based оп the Third-Order Partial Correlation Matriz: 
= and on the Zero-Order Correlation Matrix for 250 Hawaiian Children in Grades 1 and 2 


Third-Order Partial Zero-Order 

1 2 3 4 5 6 7 8 9 10 

T 1.00 —.05 —.19 —.02 —.02 -88 .08  .06 —.14 —.38 

2 —.05 1.00 —.20 —.17 —.11 —.12 95 —.10 —.17 — 17 

3 —.19 —.20 1.00 —.08 —.07 —.40 —.23 4 50 11 

4 —.02 —.17 —.08 1.00 —.08 .20 —.30 00 —.07 72 

5 —.02 —.11 —.07 —.08 1.00 -16 —.20 53  .16 —.40 
га 6 .88 —.12 —.40 .20  .16 1.00 —.10 —.12 —.08 — 22 
7 .03 .95 —.23 —.30 —.20 —.10 1.00 —.11 —.22 —.18 

8 .06 —.10 .44  .00 .53 —12 —.HM' 10 13 05 38 

9 —.M -.17 .50 —.07 .16 —.08 —.22 —.13 1.00 — 16 
10 —.36 —.17 ll 72 —.40 —.22 —.18 —.10 —.16 1.00 
—— ‚ен 


L4 


was not unreasonable to suppose, however, that the factorial 
composition of motivation to achieve in School changes with age. 
Indeed, such is almost certainly the case. Yet, despite the con- 
vietion that changes with age in the factors affecting the test 
Tesponses were to be expected, attempts to interpret the changes 
Were not highly successful. 

» Full exploration of this problem led to question as to the 
dependability of factor loadings obtained from phi coefficients 
based upon relatively small numbers of cases. Although the 
general plan of the investigations that have been done was to 
have at least 200 cases for any factor analysis, it seemed pos- 
Sible that this number was too small. Hence a plan was devised 
whereby routinely each sample was divided at random into halves 
and separate factor analyses were made for each half аз well 

* as for the total sample. Then the general plan was to investigate 

the similarity of the three sets of factor loadings for each sample 

by inspecting the correlations of the loadings from the three 

Solutions, i.e., for the two half samples and for the total sample. 

In this approach, a factor for the total sample was regarded as 

Verified when a factor in one half sample and a factor in the other 

half sample each shows its highest correlation for the same 

factor in the total sample at the same time that these same 
factors for the half Samples have the highest correlation of any 

Раїг of factors across the half samples. Thus factor 2 for the 

first half sample might correlate 85 with factor 3 for the total; 
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TABLE 6 


Correlations among Factor Loadings of Four Factors Based on the Third-Order 
Partial Correlation Matriz and on the Zero-Order Correlation Matriz for 
250 Hawatian Children in Grades 1 and 2 


Third-Order Partial Zero-Order 
1 2 3 4 5 6 7 8 
E 1.00 —.08 —.24 —.13 90 05 —.10 —.04 
2 —.08 1.00 —.26 —.20 —.03 91 —.33 —.20 
3 —.24 —.26 1.00 —.21 —.46 —.09 55  .53 
4 —.13 —.20 —.21 1.00 20 —.47 30 —.10 
5 90 —.03 —.46 .20 1.00 —.08 —.25 —.09 
6 05 .91 —.09 —.47 —.08 1.00 —.17 —.19 
Xt —.10 —.33 55 .30 —.25 —.M 1.00 —.20 
8 —.04 —.20 53 —.10 —.09 —.19 —.20 1.00 


factor 3 for the second half sample might correlate .77 with 
factor 3 for the total; and factor 2 for the first half sample might 
correlate .73 with factor 3 for the second half sample. If these 
were the highest of the correlations inspected for these factors, 
the factor for the total sample would be regarded as verified. 

Results of a number of applications of this approach are pre- 
sented in Tables 1, 2, 4, 5, 7, 8, 10, 11 and 14. The other tables 
show further comparisons among different factor analyses. 

Detailed inspection of these tables has led to the conclusion 
that the most defensible interpretation of factors results from 
the five-factor analysis based upon a group of 2313 cases, in- 
cluding 2063 four-year-olds and 250 first- and second-graders. 
The five factors for this total group were verified more clearly 
than for any other subsample. Hence the interpretation of factors 
that can be offered now will be based upon this analysis for the 
total sample. At this stage, however, it will have become apparent 
that interpretations of factors gleaned from responses of young 
children to dichotomous items of the type in question are tenuous 
at best and must be based upon very large numbers of children. 

Although the KR-20 estimates of reliability for the total 
test score оп Gumpgookies have been in the neighborhood of .85 
to .90, the estimated reliabilities, as determined by KR-20 
coefficients, for the five factors of course are not so high, ranging 
from .35 to .55 for the large combined sample. This is not surpris- 
ing, since the total test consists of only 75 items. An indicated 
next step, if any particular factor is to be explored more fully, 
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TABLE 9 


| 
Correlations among Loadings on Five Factors Based on the Third-Order Partial Correlation 
Matriz and on the Zero-Order Correlation Matriz for 1813 Head Start Children $ 


Third-Order Partial . Zero-Order 

1 2 3 4 5 6 7 8 9 10 

1 1.00 —.17 —.20 —.13 —.16 .27 —.41 00 .72 07 

2 = 1.0 —.15 —.26 — 22 -21 —.12 —.25 —.35 65 

8 —.20 —.15 1.00 —.21 — 18 .10 54 —.05 —.40 03 

4 —.13 —.26 —.21 1.00 —.18 —.05 —.01 27 .21 —.46 

№ 5  —.16 —.22 —.18 —.18 1.00 —.05 23 16 —.08 —.34 

6 A 2 .10 —.05 —.05 1.00 —.37 —.21 —.16 —.31 

7 —4 —.12 .54 —.01 .23 —.37 1.00 —.29 —.11  .04 

8 00 —.25 —.05 .27  .16 —.21 —.29 1.00 —.20 —.06 

9 72 —.35 —.40 .21 —.08 —.16 —.11 -.20 1.00 —.08 

10 .07 .65 .03 —.46 —.34 —.31 ‚4 —.06 —.03 1.00 

С NU M 

Ex 


will be to increase the number of items contributing strongly to 
that faetor and have a single test for it. With an increase in 
number of items per factor, the factor score reliabilities may be 
expected to increase, 

For the interpretation of factors, once they have been verified 
by the method described above, the method has been first to 

” Нађ for а factor the items that have their highest loading on it 
for the total sample. Then the loading of that item for the 
corresponding factor in each half sample is recorded, with a 
notation as to whether it is the highest loading for the item. 
Greatest weight is accorded those items for which there is verif- 
ication in all three analyses, i.e., for which the highest loadings 
apply to the appropriate verified factors. Attention is also given 
to the size of loadings, those of about .30 or above tending to be as- 

* sociated with greater verifiability than those below .30. 

In the discussion that follows, the factor numbers in parentheses 
following the letter designations refer to the numbers for the 
total group analysis and the two half-group analyses in Table 14. 

Factor A (11, 1, 6) consists of items indicating an autonomous 
Activity orientation permeating the use of time and interaction 
With others. This on-the-go behavior is more than generalized 
activity; it is initiating and engaging in specific behavior that is 
always appropriate to insure success in the particular tasks and 
Situations at hand. It involves both knowing the effective in- 

"^  strumental steps and taking them. These activities are instru- 
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TABLE 12 


Correlations among Factor Loadings of Four Factors Based on the Third-Order 
Partial Correlation Matrix and on the Zero-Order Correlation Matrix for 


1813 Head Start Children 
Third-Order Partial Zero-Order 
1 2 3 4 5 6 7 8 
1 1.00 —.20 —.28 —.18 18 —.21 S vi 67 
2 —.20 1.00 —.21 —.30 18 45 —.22 —.36 
3 —.28 —.21 1.00 —.28 08 35 —.10 —.28 
4 —.18 —.30 —.28 1.00 —.05 —.37 .28 .22 
5 18  .18 08 —.05 1.00 —.47 —.21 —.1l 
6 —.21 .45 35 —.37 —.47 1.00 —.14 —.08 
T 17 —.22 —.10 .28 —.21 —.14 1.00 —.17 
8 67 —.36 —.28 .22 —.11 —.08 —.17 1.00 


mental to achievement in general, e.g., wanting to work longer; 
to achievement in school, e.g, keeps trying to write numbers; 
as well as to obtaining reinforcement for achievement, e.g, 
shows its paintings to others. Perhaps this interpretation can 
also include ways of thinking—attitudes about school as in- 
strumental covert behavior for success in school. If so, the few 
items suggesting that school and learning are liked are still con- 
sistent within this framework. In any case, the factor consists 
of thinking of and doing those appropriate activities that are 
instrumental to achievement. It might appropriately be named 
instrumental activity. 

The reflection of a preference for school- and teacher-related 
experiences is clear in factor B (12, 2, 9). The specific items in- 
clude wanting to go to school to learn and liking learning along 
with watching and helping the teacher as opposed to playing 
or engaging in other activities. This positive attitude toward 
school is further exhibited by an identification with the teacher, 
e.g, wanting to be the teacher when playing school. Factor В, 
then, appears to be a school enjoyment factor. Because items 
dealing with work-like activities in non-school settings were 
only sparsely represented in the total test, however, the pos- 
sibility that the factor would be better described as work en- 
joyment has not been ruled out. 

The items constituting factor C (13, 5, 8) represent the ability 
to evaluate one’s own performance coupled with the confidence 
that the evaluation will be high. The process of self-evaluation 
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is suggested by items portraying gumpgookies who know when 
their work is right, when they are doing well in school, what 
they can and cannot do, and whether or not they are always 
doing their best. Items describing gumpgookies who are self- 
evaluated as always at their best and doing well also suggest an 
awareness of their own excellence. Perhaps this factor can be 
considered an evaluative factor. 

Factor D (14, 4, 10) consists almost entirely of items set in 
competitive physical situations, e.g., winning in running, climbing 
higher, and leading in follow the leader. Apparently it repre- 
sents self-confidence in coming out on top, in being the best or 
better than the next one. With additional items staged in other 
settings, it seems likely that the factor would transcend physical 
activities. 

The common denominator for items loading on factor E (15, 
3, 7) has to do with an awareness of implications of present 
behavior for the future—perhaps even more specifically for 
accomplishment of a future goal. The gumpgookies in these 
items are still trying to attain their future goals, eg., trying 
to write. They are apparently directed by their own self-initiated 
Purposes, This, then, appears to be some type of purposive 
factor. The need for further verifieation of this interpretation 
with additional items or through experimental work is evident. 
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SUPPRESSOR VARIABLES, PREDICTION, AND THE IN- 
TERPRETATION OF PSYCHOLOGICAL RELATIONSHIPS! 


ANTHONY J. CONGER? лхо DOUGLAS М. JACKSON 
University of Western Ontario 


Охи of the traditional problems confronting the applied mea- 
surement, specialist in psychology is the large-scale prediction 
of particular criteria. Because it is often difficult to find more 
than a small number of predictors contributing to incremental 
validity, the idea of a suppressor variable (Horst, 1941)—one 
contributing to incremental validity while itself uncorrelated 
with the criterion—has continued to capture periodically the 
imagination of those confronted with prediction problems. The 
fact that bona fide suppressors have only rarely been reported 
(Lord and Novick, 1968) has not diminished the search. In a 
similar manner, ever since the technique of partial correlation 
was developed, research workers have sought to interpret psy- 
chological relationships more accurately and more meaningfully, 
by secking to eliminate statistically the effects of unwanted 
variance in evaluating the correlation between two psychological 
variables. 

The present paper seeks to cast additional light on these two 
distinguishable but related approaches. In particular, three major 
aims are proposed: (a) to explicate, definitionally and mathemati- 
cally, the nature of the suppressor variable, and to show the 
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conditions under which there are formal identities between re- 
sults based upon analyses using a suppressor variable, and those 
based upon approaches involving multiple, and part and partial 
correlational procedures; (b) to demonstrate certain mathematical 
constraints upon the degree of gain in predictability of a criterion 
by the addition of a suppressor, and to compare this potential 
gain with that attainable by seeking new predictors; and (¢) 
to distinguish the distinct aims of prediction and of construct 
measurement as they each are relevant to the statistical control 
of unwanted variance, and to make appropriate recommendations. 

A number of authors (Horst, 1941; Meehl, 1945; McNemar, 
1945; Rozeboom, 1966; Lord and Novick, 1968) have defined 
the mode of operation, as well as a mathematical basis, for the 
suppressor variable and some agreement has been reached about 
difficulties with the suppressor variable. Most articles dealing 
with suppressors point out that suppressors “do not abound in 
practice.” It is also well known that suppressor weights are 
somewhat more fickle than regular regression weights upon cross- 
validation, a point proven by Lubin (1957). Most treatments 
of suppressors also stress that multiple regression provides the 
best explanation of suppressors, although no alternatives are 
usually investigated. Darlington (1968), for example, suggests 
that there is a common, but erroneous, belief that a modification 
of standard multiple regression is required for an understanding 
of the mode of operation of suppressors. 

Some disagreement exists as to what does constitute a sup- 
pressor variable and there has been little research in the way of 
relating suppressors to other well-known correlational phenomena 
like part and partial correlation or moderator variables. Some 
questions have also been raised as to the situations in which the 
suppressor is, and is not, appropriate. Some of the confusion 
surrounding the suppressor is based upon a loosening and re- 
interpretation of the original Horst definition, but the major diffi- 
culty, in our opinion, is due to the failure to distingish between 
the use of suppressor variables in three different contexts: (a) in 
the prediction of a particular criterion; (b) for the measurement 
of a latent trait or construct; and (с) in the delineation of 
relationships between constructs. For these distinct uses, different 
logical and psychological foundations are required. 


à 
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The Definition of a Suppressor Variable 


Before taking up these issues in detail, it would be appropriate 
to review the classical definition of a suppressor variable, and 
to consider more recent discussions of this formulation. While 
portions of this introductory section, particularly the illustrative 
material, may appear elementary to some readers, it is con- 
sidered worthwhile to eliminate simple definitional ambiguity 
prior to proceeding with the main arguments of this paper. 

A suppressor variable is here defined, in a manner consistent 
with the classical definition, as one wholly uncorrelated with a 
criterion, but which, by virtue of a correlation with a predictor, 
improves the prediction of the criterion. 

There is a paradoxical quality (McNemar, 1945) associated 
with a suppressor, in that it is possible to increase prediction by 
utilizing a variable which shows a negligible correlation with the 
criterion, provided it correlates well with a variable which does 
correlate with the criterion. This apparent paradoxical quality 
becomes intelligible when one considers an illustrative example, 
one provided by Horst (1966, p. 355) growing out of his experi- 
ence in the prediction of success in pilot training in World War II. 
Included in a prediction battery were tests of mechanical ability, 
numerical ability, spatial ability, and verbal ability. Each of the 
first three had substantial positive correlations with the criterion 
of success. Verbal ability, however, had a near-zero correlation 
with the criterion but fairly high correlations with the scores for 
the other three tests. The multiple correlation with the criterion 
Proved to be higher when verbal ability was included, than 
When it was not included in the predictor battery, in spite of 
its negligible zero-order correlation. Horst interpreted this find- 
ing Psychologically by pointing out that high verbal aptitude 
was not important in the kind of flight training conducted in 
World War II. However, verbal ability was to some degree im- 
Portant in obtaining relatively higher scores on mechanical, 
numerical, and spatial aptitude tests; for example, reading com- 
Prehension facilitated understanding of test instructions, leading 
in turn to a higher score. When variance associated with criterion- 
irrelevant verbal aptitude was subtracted from the weighted 
Score based on the other three predictors, their efficiency as 
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predictors improved. Persons who obtained a particular weighted 
composite score on these tests primarily because of high verbal - 
aptitude would thus not tend to be selected over those who 
obtained a similar score based primarily on the abilities required 
to learn to fly an aircraft. Thus, prediction was improved be- 
cause variance associated with verbal ability was suppressed. 

The operation of a suppressor variable may be illustrated 
further by considering a Venn diagram (Figure 1) and by utiliz- 
ing the common elements formulation of correlation (cf. McNemar, 
1945, 1962). The criterion (c) is comprised of 16 elements, of 
which seven are in common (depicted by c-p) with the predictor 
(p). The preditor also is comprised of 16 elements, with nine 
irrelevant to the criterion. For this relationship the common 
element correlation yields тор = .44. If eight of the nine irrelevant 
elements are accounted for (depicted by p-s) by the suppressor 
(s), which shares no elements with the criterion, the zero-order 
correlations тр, and та are .75 and .00 respectively. Inspection of 
Figure 1 shows that although the suppressor is wholly unrelated | 
to the criterion, it is useful in identifying those elements in the 
predictor common to the criterion. This influence is apparent 
in the multiple regression equation based on the zero-order corre- 


predictor(p) 


criterion(c) 


suppressor(s) 


Figure 1. Venn diagram of common el i 
variables, on elements explanation of suppressor 
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lations: Z, = .661 Z, — .496 Z,. The regression weights indicate 
that the weighted suppressor variable should be subtracted from 
the predictor in order to remove the criterion irrelevant variance. 

Ordinarily, the regression weight of a suppressor variable in 
a multiple regression equation will be negative in sign, but a 
potential source of confusion arises when all negatively-weighted 
variables are considered as suppressors. For example, Lubin (1957) 
and Darlington (1968) loosen the traditional definition to include 
such “negative suppressors,” viz., a variable with a positive 
relation to a criterion, but a negative one with some other pre- 
dictor. Darlington defines a suppressor as a variable which, when 
included with a positive predictor, receives a negative weight 
when a regression weight is derived on the “population.” Darling- 
ton’s definition thus includes Lubin’s negative suppressor as well 
as Horst’s traditional suppressor. It is possible that the prac- 
tice of referring to all variables with negative regression weights 
as suppressor variables derives from experience in predicting per- 
formance criteria from aptitude and achievement test batteries. 
Aptitude tests showing a significant but negative correlation with 
а performance criterion—indicating that persons with lower apti- 
tude scores are showing superior criterion performance—are perhaps 
just as paradoxical as traditional suppressor variables. But para- 
doxical or not, this situation is logically distinct from the suppressor 
variable as here defined. This apparent paradox is eliminated if one 
considers predictors to be like bipolar attitudinal or personality 
scales or cognitive style variables, whose direction of keying and 
direction of positive evaluation may be arbitrary. What is a pre- 
dictor (e.g., flexibility) for one investigator, might become a sup- 
pressor (e.g. rigidity) for a second investigator. Obviously there 
are problems with these more general “suppressors” in that exclud- 
ing some variables and including others could change a predictor 
to a suppressor and vice versa, or as Lord and Novick (1968) point 
out, simple reflection of the variables converts the predictor to a 
suppressor and the suppressor to a predictor. Obviously, the more 
general definitions of suppressor variables need closer scrutiny. 
In order to avoid terminological and conceptual confusion, we 
recommended that the term, suppressor, be limited in the psycho- 
logical literature to the classical definition, and that it be expunged 
from other contexts. Following this recommendation, this paper is 
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limited to a discussion of the traditional suppressor as originally 
defined by Horst (1941). 


The Relation of the Suppressor to Part and Partial Correlation 


Gardner Murphy (1932) once stated that the partial correlation, 
together with the discovery of the conditioned response, ranked 
as one of the two most important discoveries of twentieth century 
psychology. As early as four decades ago Murphy had insight 
into the tremendous power inherent in the possibility of holding 
one variable constant statistically while observing the effects 
of additional variables, a power that has more recently become 
manifest with developments in factor analysis. While some 
might dispute Murphy’s assertion about their importance, the 
part and partial correlation have enjoyed a place in the psycho- 
logical statistics texts of the present century. But in spite of 
their appearing together in measurement texts, the suppressor 
variable has not been sufficiently related to partial (and part) 
correlation. Perhaps this is because they stem from different 
traditions—the suppressor from large scale prediction of criterion 
performances, and the part and partial correlation from attempts 
to interpret the psychological bases for relationships by the 
statistical control of theoretically distinct variance. While there 
have been isolated hints of a relation between these two formula- 
tions (Jackson and Pacine, 1961; McNemar, 1962, p. 406, Prob- 
lem 10.27; Rozeboom, 1966), a review of the literature has 
ae to us no explicit mathematical treatment of their relation- 
ship. 

According to McNemar (1962), if the influence of one vari- 
able is removed from another and correlated with a third vari- 
able, a situation exists in which the part correlation should be 
used; however, this is precisely the justification for suppressor 
relationships. The part correlation, тег), is given by 


figi rudi, 

Lui 
In particular, if то, is equal to zero, an ideal suppressor situation, 
then the part correlation formula yields 


Те.) = 


Тер 


Тор.) = .سک‎ 
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This correlation is clearly higher than Тер in all circumstances 
where та ¥ 0. This formula is identical to that derived by 
Меећ] (1945) for a suppressor variable with та = 0. Thus, the 
part correlation and the suppressor variable are mathematically 
identical. 

Removing the influence of a third variable from two others 
(the influence of a suppressor from both the predictor and criterion) 
is a situation in which a partial correlation should be used, i.e., 


Tep — Тера 
т». = SSS IL 
VL = re V1 — 
This correlation would be even greater than the part correlation. 
Under classical suppressor conditions with re = 0 this yields 


Ten 
Т.».. = VI =r 
This is the same as the part correlation and, again, is a result 
identical to Meehl's formula for the suppressor. This should not 
be surprising because the suppressor does not really influence the 
criterion, only the predictor; there is thus no influence to remove 
from the criterion. The part correlation theoretically explains 
the suppressor relationship and offers no paradox. It also yields 
the same degree of relationship (under “ideal” conditions) as 
multiple regression. 
The equation for a two variate multiple correlation is 


ae Vira! + ru! — Weep, 
ed М1 = tye 


Under ideal suppressor conditions (та = 0) this reduces to, 


Тер 
Тера = Vi- nd 
This is a formulation we have already seen in the part and 
Partial correlation situation, and in the suppressor. That is, there 
is no difference either in meaning or in formula between any of 
these formulations under ideal suppressor conditions. The simi- 
larity between the part and partial correlation is a result of the 
“suppressor” having no influence on the criterion and might be 
Considered to border on the trivial in comparison with the other 
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identities. The equality of the suppressor and multiple regression 
formulations is well known and needs little elaboration. The 
most striking result is the equivalence of the part correlation 
and the multiple correlation. This equivalence provides a mathe- 
matical basis for considering a part correlation approach to sup- 
pressor variables as an alternative to the multiple regression 
approach. Conceptually, the part correlation approach seems more 
appropriate and less paradoxical, 


Which Should be Used, the Part, Partial, or Multiple Correlation? 
The Classical Prediction Problem 


Suppose ideal suppressor conditions do not exist, should the 
part, partial or multiple correlations be used? The answer de- 
pends upon one’s purpose. Consider the case where maximum 
validity is sought. Consideration of the squared part correlation 
versus the squared multiple correlation yields a definite rela- 
epee between the two. The squared part correlation is found 
rom 


тагата tyre 
tie Tos ? 
and, the squared multiple correlation is given by 


2 
Т.а) 


а | Pep — 2. + Toe. 
Tes = 1—2 2x 
The difference between them is ф 


Yor — Ter tps. 
1 — Ta 

and this further simplifies to т,2. That is, тор? — Toip.s)? is equal to 
Tea. As expected, the squared multiple correlation yields the maxi- 
mum value and this value is simply a function of the degree to 
which ideal conditions are not met, that is, the degree to which 
Tos is not equal to zero. This suggests that even if part correlation 
is the best theoretical formulation of suppression, it unfortunately 
will not invariably yield the maximum relationship; when ideal 
suppressor conditions do not exist, regular multiple regression will 
be better. If the goal is strictly empirical prediction, with the 
emphasis upon maximum validity, there is no advantage in using 
part or partial correlation rather than the standard multiple regres- 


2 2 
Tepe Тер.) = 
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sion formula. If suppression effects are present, they will be revealed 
as a result of this analysis. But it is important to be clear as to one’s 
intent. Empirical prediction implies that one seeks repeatedly to 
use a battery of tests on samples of subjects to make decisions to 
optimize the utility of certain outcomes. In empirical prediction the 
primary goal is not the understanding of relationships but it is a 
strictly utilitarian one. Although this is the context in which a great 
deal of test theory is promulgated, it is important to recognize that 
some authors (Loevinger, 1957) have argued that the assumptions 
underlying this model (e.g., an invariant criterion) are rarely pre- 
cisely met. In any case, it is important to differentiate the goals 
implicit in empirical prediction from those of understanding the 
nature of psychological processes. If theoretical interpretation of 
psychological relationships is the primary goal, a different rationale 
is appropriate, one embodying that underlying the part correlation 
rather than the suppressor. 


Partial Regression and the Nature of Psychological Relationships 


The term, suppressor variable, has probably been associated 
with the development of the K scale of the MMPI (Meehl and 
Hathaway, 1946) more frequently than with any other single 
scale, The rationale for the development of the К scale was 
essentially based upon the empirical prediction model. The goal 
was to identify within a deviant psychiatrie population a set 
of items differentiating those individuals with (presumably valid) 
elevated clinical scale scores from those with apparently normal 
(presumably invalid) scale scores. Subsequently, this set of items 
was to be used to suppress the reliable but apparently invalid 
variance in the clinical scale scores with the expectancy of more 
valid assessment. The rationale of the K scale has an appeal 
when considered in the context of large sample prediction of a 
Particular stable criterion. Unfortunately, it is inappropriate 
when applied to the vast majority of uses to which the MMPI 
is put. The resulting use of the K scale as a “correction,” 
While possible defensible on other grounds, is an irrelevant ap- 
plication of the suppressor rationale, yielding at times some rather 
illogical results. For example, pathological behavior can be ascribed 
to a respondent because, and only because, he has received an 
elevated K-scale score. 
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There is, furthermore, the problem of the instability of regres- 
sion weights derived from a particular sample. The criterion of 
the concurrent assignment of psychiatrie diagnoses at the Uni- 
versity of Minnesota hospitals was susceptible to a number of 
local conditions affecting validity, such as the type of population 
attracted, local administrative procedures and biases affecting 
antecedent probability and base rates (Meehl and Rosen, 1955). 
In spite of the widespread use of the MMPI, its application in 
large sample prediction of this type has been rare, and we are 
aware of no study seeking to cross-validate the use of the K 
scale as a suppressor in the classical sense. But the suppressor 
rationale is based on the prediction of a particular criterion; its 
generality is an empirical matter and cannot be assumed. Never- 
theless, the MMPI typically has been used as a means by which 
characteristics or latent attributes are assigned to individuals on 
the basis of their test scores (Jackson and Messick, 1958, 1962). 
Under these circumstances, what is important is a reliable, un- 
biased estimate of a respondent’s location on the latent dimension, 
based on measures which are free, insofar as possible, from 
sources of substantive and methodological irrelevance. Therefore, 
it would be justifiable to remove an unwanted source of variance 
even if this resulted in a decreased validity with a particular em- 
pirical criterion. 

The important consideration here is that the test score should 
reflect a particular dimension possessing construct validity rather 
than merely empirical validity. On these grounds, the part cor- 
relation rationale seems preferable to the multiple regression 
rationale. The focus would thus shift from maximizing validity to 
minimizing sources of bias. There can be no objection to the part 
correlation approach on the grounds that regression weights 
cannot be obtained, since the partialling procedure would be 
done in such a way as to remove maximally the unwanted 
variance. This could be done by the equation originally sug- 
gested by Horst (1941) and by Lubin (1957) or Meehl (1945) 
with reference to suppressor variables, namely to form a neW 
variable by removing the unwanted variances of the suppressor: 
Zy = Zp — Tag. In a second step one could correlate the variable 
p’ with a second variable with the intent of understanding the nature 
and magnitude of the relationship between two variables, one with & 
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correction for irrelevance. Alternatively, one might wish simply to 
interpret the corrected score as reflecting a certain degree of an 
attribute, which forms the basis for some further analytical treat- 
ment or assessment decision. The part correlation thus provides a 
basis similar to multiple regression for the suppression of irrelevance 
even if the increment in validity is not great. For example, if pre- 
senting a good impression of oneself is irrelevant or logically distinct 
from a certain personality trait, then this influence should be re- 
moved from any measure of the trait, so that people who manage 
impressions do not receive biased scores, 

Perhaps because of the association of the suppressor with the 
K scale of the MMPI, or perhaps for other theoretical reasons, 
a number of attempts have been made (Fricke, 1956; Fulkerson, 
1958) to remove the influence of acquiescence by the use of the 
Suppressor variable format. These early attempts failed to find 
a large improvement in prediction. Dicken (1963) undertook an 
investigation in which he expected good impression, social desir- 
ability and acquiescence to act as suppressors for the California 
Personality Inventory. Dicken summarized his results as follows: 

Suppression of desirability resulted in significant predictive 

gain in only 4 of 24 comparisons in the high school data. In 

the non-high school data, only 2 of 36 comparisons show a 

significant suppression effect, a result attributable to chance. 

There is no instance in the grand total of 50 comparisons of 

a large gain in validity by suppressing desirability. . . . 

The expectation that correcting personality scores for individ- 

ual differences in desirability responding will increase valid- 

ity is not fulfilled. There were no instances of significant gain 

in validity by suppression of acquiescence variance. (p. 712). 
Dicken’s results thus indicate that these stylistic scales do not 
Taise the empirical validity of the standard CPI scales; however, 
it should be pointed out that Dicken was dealing with low valid- 
ities, ie. validities of the order of .30, which, as will be shown, 
markedly constrains the possibility of suppressor effects. More 
Tecently, Goldberg, Rorer, and Greene (1969; Greene, 1967) in- 
vestigated the usefulness of stylistic scales as potential suppressors 
or moderator variables in predictions from the CPI. Their stylistic 
Scales for the most part satisfied suppressant criteria; they were 
highly correlated with CPI scales and virtually uncorrelated with 
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their various criteria. They used 13 potential suppressors and 
moderators for 13 criteria-predictor pairs. In 30 of the 169 com- 
binations there was a reasonable expectation of suppression. Under 
cross-validation half of them yielded a lower value than the zero- 
order validity, something not entirely unexpected in view of Lubin’s 
(1957) analysis of the suppressor situation. Of all of their relations, 
only one showed any substantial incremental validity over the 
predictor alone. 

It would be instructive to examine what reasonably might have 
been expected, given the magnitude of predictor-criterion cor- 
relations, as they typically occur, e.g. in the Goldberg, Rorer, and 
Greene study. This study is selected merely as illustrative. Others 
might have served equally well. Table 1 presents a set of validity 
coefficients selected from the single CPI scale with the highest 
validity for predicting each of 13 criteria. If the suppressors 
correlated zero with their criterion measures and possessed an 
T, of .50, the maximum theoretical value is given in the second 
row of Table 1. (In practice, the suppressor-criterion correlation 
will not usually be precisely zero, nor, given the short stylistic 
scales used by these authors, will the reliability always be high 
enough to warrant the assumption of an ra, = .50; these departures 
from our assumptions would tend to lower the maximum theoretical 
suppressor effect.) In reviewing Table 1, note that the theoretical 
increments in validity through the use of a suppressor of the order 
of .02 through .08 are not very large; however, they are the best 
that can be expected under the stated conditions. The best suppressor 
effects obtained in this study selected from a much larger number 
are given in the third row of Table 1. Note that the only sub- 
stantial gain was for the CPA scale, a gain of .11, and that the 
departures from the theoretical maximum values ranged from 
.02 to .06. While substantial suppressor effects did not appear 
empirically, there were scant mathematical grounds for expecting 
any. The level of the reported validities did not permit such 
effects. 

A The conclusion of Goldberg, et al. was substantive, however: 
Consequently, it now seems safe to conclude that stylistic 
variables, per se, do not function as general suppressor variables.” 
These authors concluded the same about moderator variables, 
something that might be expected, given the intimate relation- 
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ship that does exist beween a suppressor variable and a moderator 


concluded that under the conditions of their study, validities 
showed no substantial decrease when unwanted stylistic variance 
and response biases were eliminated from their predictors. 


Limitations in Incremental Validity Through the Use of 
Suppressors 


It would seem that some systematic knowledge of the mathe- 
matical limitations imposed upon incremental validity in the 
use of suppressor variables is called for, given the disappoints 
ing empirical results, Consider the ideal classical suppression 
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peste as ай тый ا ا‎ Re. masters em 


larger than the eriterion-predietor correlation; probably, гә 
т» Will each be nearer AO than .75, but all possibilities should b 
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evaluated. Table 2 shows what can be expected in the way of incre- 
mental validity for pairs of validity and suppressor relations. For 
example, if a validity of .40 is obtained, a suppressor relation of .60 
is needed to increase this by .10. Or taking Goldberg, Rorer and 
Greene's (1969) maximum validity of .51, a suppressor-predictor 
correlation of .50 is required to increase the overall relationship to 
around .59. The lowest validity reported by Goldberg, et al. was .13; 
in order to inerease this by .07 one needs & 

correlation of over .80, a value which might well exceed the re- 
liability of the predictor, Removing the suppressant would be 
equivalent to removing all of the valid variation. If one considers 
40 аз a validity typical of many psychological tests, the largest 
increase that could be expected would be .267, This would increase 
the original validity to .667, which would be very impressive; how- 
ever, the suppressor-criterion correlation would have to be 80 for 
this to be achieved. To make things worse, improvement is not 
linearly related to the suppressor-predictor correlation. The relation- 
ship is such that the increment in improvement is less for lower 
correlations than for larger ones, 

In Figure 2 curves are shown jn which 8, the Increment in pre- 
diction, is expressed as a function of the suppressor-predietor соре 
relation for a fixed criterion-predictor correlation. These show ` 
that maximal increases сап be expected for large validities and 
the expected gain becomes near zero for small validities, unless 


10 000 001 002 002 ооз 003 004 004 O04 005 005 005 005 
-20 002 004 006 008 010 012 014 O16 017 O18 O19 020 
М 005 010 015 021 026 031 087 042 015 O48 00 
.40 00 017 026 035 04 052 002 070 076 080 

- 015 030 045 060 075 000 107 120 130 

40 025 OW 075 100 125 150 175 200 

Л1 O 082 122 163 204 245 290 

Юю 007 133 200 267 333 400 

1 100 200 300 400 мо 

02 150 300 450 60 . 

95 23 467 70 

93 400 500 

.995 900 


— Á————————— 
Nota. — Рената] pointe aro овца trom the lacrements, which ase exgremet is мови 
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Figure 2. Change in prediction, 2, due to 1 
function of т,» for given values of rep. & suppressor variable shown as 8 


the suppressant effect is very large. The total picture is given in 
Figure 3. It clearly shows that a linear relationship obtains 
between the change in prediction for a fixed suppressor-pre- 
dictor correlation (ғ), whereas there is a curvilinear relation- 
ship when the eriterion-predictor correlation (r,) is fixed and 
the remaining two components are free to vary. Thus, there is 
likely to be a discrepancy between subjective expectations and 
findings. An investigator who expects a linear improvement for 
suppressor effects is likely to be disappointed because these 
effects are curvilinear; they are smallest for lower and more 


fs 
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frequently-encountered values of тор, while showing an accelerat- 
ing increment only for higher and more rarely-encountered zero- 
order validities. 


Is It more Useful to Seek Suppressors or to Seek New Predictors? 


Which strategy will generally yield the best payoff in terms 
of incremental validity, adding a second predictor or adding a 
suppressor? Will a predictor with a given level of validity add 
more than a suppressor with the same level of correlation with 
the first predictor? 

Denoting the second predictor by а and letting rg, be greater than 
zero we have a maximum independent contribution from а when 
Тра = 0. Of course, if negative suppressors are allowed, g could con- 


100, 6= resp Tep 


(40,92) 
Max (бр, Бр) from r + =1 


Figure 3. Three dimensional representation of the change in prediction, 8, 
due to a suppressor variable. 


596 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


tribute more if гра were less than zero; however, the present concern 
is with а simple predictor rather than the more complex (and less 
likely) negative suppressor. The increment in validity due to the 
additional predictor is Тора — Top where тора is the multiple correla- 
tion for predicting c from both p and q, and it is given by the 
equation 


„ШЕМ dne des rro 
1—n 
Under the stated conditions that тр, is zero, this reduces to 


Тоз = Мт F Tea 


The increment àj, which represents the improvement in prediction 
of multiple correlation over zero order correlation, is therefore 


Кє: чу 3 ЕЈ 
Oy = та — Те = To + Теа. — Те». 


The increment for the suppressor given above is 


8, = roll — г) — 1). 
If the increments are to be equal, then 3, = 8, and 


tel — te)? Ут F nu. 


Simple algebra reduces this to % =8, if and only if 


Te.va 


2 

Теа r2 
ал 
Ме, a 


This shows that ra, must be greater than Тер in order to obtain the 
Same increase in validity. For example, let re = .40, then 


8, = 40 [(1 — 1,9)? — 1] 
and 


ôs = V.16 dr? — 40. 
If а second predictor is found such that Tq = 0 and та = .30, then 
8 will be equal to -10, that is, there will be an increase in validity 
of .10. In order to get an increment of 10 with a suppressor, We 
solve the 8, equation for Teo” and find that Tsp must be equal to .60. 
In this case, the Suppressor-predictor correlation must be twice аз 
large as the second predictor-criterion correlation. In terms of vari- 
ance accounted for, the suppressor must account for four times a8 
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much variance in the predictor as the second predictor must account 
for in the criterion. Obviously, efforts would be better spent looking 
for correlations of the same size with the criterion rather than with 
the predictor. A suppressor for any given degree of correlation does 
not yield as much incremental validity as an additional predictor. 
Therefore, more can be gained by finding that part of the criterion 
not being predicted rather than that part of the predictor not being 
used. 

In the light of all of these considerations, should the uncritical 
search for the suppressor variable be suppressed? 


Conclusions 


1. The use of a suppressor rationale in prediction may be justi- 
fied under certain conditions where it can be demonstrated 
that it is possible to account for a reliable proportion of the 
predictor variance after cross-validation in terms of a vari- 
able not associated with the criterion. This situation is rarely 
encountered in practice. 

2. Under ideal conditions for the operation of suppressor effects, 
ie, where the suppressor-criterion correlation is zero, it is 
shown that the suppressor approach will yield results mathe- 
matically identical with those obtained from the part and 
partial correlation. 

3. If maximum prediction is the primary goal, the use of the 
Suppressor approach is appropriate if, and only if, the sup- 
pressor-criterion correlation is zero. Where the value of the 
suppressor-predictor correlation departs from zero, it is shown 
that the use of the multiple correlation will always yield a 
higher validity than will the use of the pure suppressor 
approach. It should be recognized, in any case, that the 
contribution of a suppressor-predictor to a multiple correla- 
tion will be unstable where both the magnitude of the re- 
lationship and the sample size are modest. 

. A theoretical limit on the suppressor is operative such that 
the upper bound of incremental validity over the predictor 
alone is curvilinearly related to the magnitude of the predictor- 
criterion validity, with the increment smaller for lower initial 
Validities than higher ones. The curvilinearity may be im- 
Portant in that the researcher’s subjective expected improve- 


нь 
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ment is likely to be linear; however, the objective improvement 
is not. 

. It is shown that attempting to isolate the part of the predictor 
not relevant to the criterion is ordinarily less efficient than 
predicting that part of the criterion not being predicted. New 
predictors will be easier to find than will effective suppressors. 

. A valid distinction may be drawn between the use of the sup- 
pressor approach in the context of prediction and in the 
context of the construct interpretation of psychological mea- 
sures and relationships. Whereas the suppressor can be justi- 
fied in prediction only if it substantially improves prediction, 
the use of the part or partial correlation is justified even if the 
resulting validity remains unchanged or is reduced, provided 
its application removes a conceptually irrelevant portion of 
the variance. This latter approach is preferable in studies 
focusing upon the interpretation of psychological relationships. 


л 
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FACTOR SCORES INDEPENDENT OF ITEM TRAITS! 


PAUL HORST 
University of Washington 


Мосн attention has been given to the influence of those char- 
acteristics of stimulus elements (items, for example) on responses 
of individuals, aside from the actual purported item content. 
These characteristics may be social desirability, acquiescence, 
polarity of statement (do, do not), preference value, serial 
position, attributes of response categories (primacy, recency), 
etc. The literature in this area is too extensive to make even 
illustrative references feasible here. Perhaps now well structured 
Procedures should be developed for the management of extraneous 
item characteristics in the evaluation of personality variables. 
Such an enterprise should probably proceed as follows: 

1. Exhaustive identification of those stimulus element char- 
acteristics which are independent of item content and to which 
persons may respond differentially. 

2. The definition of the elements of an item characteristic 
(IC) vector such that, when a person's response vector is multiplied 
by the IC vector, one may rationally regard such a product as the 
Person’s score for that item characteristic. For example, in the 
case of dichotomous items, an item preference vector would con- 


1 The research reported herein was performed pursuant to а grant (Number 
CG9929) of the United States Office of Economic Opportunity to the Center 
or Research in Early Childhood Education, a branch of the Education 
{ search and Development Center, University of Hawaii. Contractors under- 
E such projects under Government sponsorship are encouraged to express 
reely their professional judgment on the conduct of the project. Points of 
view or opinions stated do not, therefore, necessarily represent official position 
9r policy of the Office of Economic Opportunity. The study was also sup- 
Dated in part by Public Health Research Grant 2 ROL MH00743-15 to the 
niversity of Washington, Seattle, Washington. 
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sist of the preference or “р” values of the items. The general ap- - 
proach might be as follows: A response matrix for a sample 
of entities is defined with rows as entities. An IC matrix con- 
formable to the response matrix on the right has a column сог- 
responding to each defined item characteristic. The IC matrix is 
defined so that the product of the response matrix by the IC matrix 
yields a matrix that may logically be regarded as a matrix of 
IC scores. Definition of the response matrix will depend on the 
structure of response option patterns and on keying procedures. 
In any case, it is probably most appropriate to investigate re- 
sponse styles or other extraneous item characteristics only after 
having operationally and computationally defined the corre- 
sponding item characteristic scores. 
3. Having defined item characteristic scores, one may then 
develop models for utilizing these together with item scores in 
various ways. For example: One may partial out the IC variables 
from the item variables. One may include them with item or 
scale scores in factor analyses. One may utilize them with or 
without item or scale scores in estimating criterion measures. 
In general, one is free to utilize the IC variables in any way 
that other variables can be used in multivariate analysis models. 
The crucial consideration is that an entity response vector and 
an item characteristic vector each be defined in such a way that 
their minor product constitutes an acceptable definition of an 
item characteristic score. | 
This general problem of achieving from dichotomous items 
factor scores that are independent of particular response sets ог 
item characteristic scores was thrust upon the attention of the 
writer by Dorothy C. Adkins and Bonnie L. Ballif in connection 
with a project of the Center for Research in Early Childhood 
Education at the University of Hawaii. In their early factor 
analyses of items in their Gumpgookies, a test of motivation 10 
achieve in school for young children, attempts to interpret 
factors were plagued by an apparent contamination of the factors i 
by two types of response sets affecting the dichotomous items: 
position of the keyed answer to an item—Tright or left—and the | 
order in which the keyed answer to an item was presented to the 
subject—first or second. The methods described herein are a2 
outgrowth of their desire for factors based upon an item inter- 
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correlation matrix with response set scores partialled out. They 
have been applying the method with every indication of success 
(Adkins and Ballif, 1972). 


The Rationale 


Let 
n, be the number of entities 
п. be the number of items 
та be the number of item traits 
п, be the number of factors 
X.a be an (n, X п.) binary item score matrix 
X, be an (n, X та) item trait matrix 
M, be an item mean vector 
D, be an item standard deviation matrix 
Za be a deviation item score matrix 
We define an (n, X т) trait score matrix by 
Ха = Х.Х. (1) 
From (1) we see that a trait score is defined as the scalar 
product of an entity’s item score vector by an item trait vector. 

We consider now the general problem of calculating item 
factor scores which are independent of item trait scores. To do 
this, we first consider the problem of calculating item scores which 
are independent of item trait scores. We let та; be an (т X ni) 
deviation matrix of item trait scores. It can readily be shown from 
the definitions and equation 1 that 

Let = „Хы. (2) 
Let 

C. be an n,'th order matrix of covariances of item scores; 

ү Cu Бе an (n, X та) matrix of covariances of item scores with 
item trait scores; 

Cu be an (n, X m) matrix of trait score covariances; and 

Ч be an (n, X п.) matrix of item scores independent of item 
trait scores, 

Reversing subscripts to indicate transposition, we have from the 
definitions 


Сл = Gaabral/Me} (3) 
Сы = пати Ти 4 
Са = К Nes (5) 
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We now write 
Uea = Gea — Ler Bray (6) 


where Bta is a matrix determined so that u is independent of the 
matrix of item trait scores Tet, or such that 


типа = 0. () - 
From (4) through (7) 
O = €, — С.В. (8) 
And from (8) 
Bua = Си Cow (9) 


Equation (9) is the well know expression for the matrix of 
multiple regression constants calculated from deviation measures. 
From (2) and (6), 


thea = XS — ХаВа). (10) 


Let 


Goa = иаи. /п,, (1) 
where Gaa is now the covariance matrix of item scores independent 
of the item trait scores. 

From (8), (4), (5), (6), (9), and (11) we have 
О-о SONG Gis: (12) 
Also from (2), (8), and (4), 1 


Cu = 6-Х, (18) 
and from (9) and (12) 

ео (14) 
The matrix Gas in (14) is a covariance matrix. To get factor. 
scores independent of item trait scores, we consider first the cor- 

relation matrix derived from С. We let 
D, = diag (Gu). an 
Then tbe partial correlation matrix of factor scores with ite 
trait scores partialed out is 


Е. = рар". (10 
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We next consider a factor analysis of Raa. First we consider the 
basic structure or principal axes solution for т; factors. We let 
А; be the first n; principal axes factors of Rua. The method utilized 
will be the basic structure successive factor method (Horst, 
1965, p. 160). 
We let 

А, = ФА, (17) 
where A? is a diagonal matrix of the n; largest eigenvalues of Raa, 
and Q, is a matrix of the corresponding eigenvectors. 

Next we solve for the varimax transformation of the A; matrix 
by the simultaneous factor varimax solution (Horst, 1965, p. 
428). This solution also incorporates the algorithms used in the 
principal axes solutions, since the method requires а series of 
basic structure type solutions. We indicate this solution by 


В = А,Н, (18) 
where Н is the square orthonormal transformation which satis- 
fies the varimax criterion. 

It should be noted that the varimax criterion is independent of 
the signs of the varimax factor loading matrix. For this reason 
it may often be necessary to determine both row and column 
sign changes for a varimax factor loading solution in order to 
maximize the number of large positive elements. For some factor 
analytic solutions of item variables, it may be that some items 
were not keyed to give the optimal positive manifold. Suppose 
then we let 

b = i,Bin, (19) 
where i; is a sign matrix operating on rows of В and їв operates 
on columns. As a first approximation we give i, the signs of the 
corresponding elements in Аз, the first principal axes factor 
loading vector. This we may indicate by 

yi, = sign (Рд..). (20) 
We let 
1b = В. (21) 


We then take as ip the signs of the column sums of 16 in (21) and 
write 


ab = Ва. (22) 
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Next from 2b we make up a sign matrix si; from the signs of 
largest absolute values in the rows of 2b of (22). We then let 


d, = а (23) 

Hence for the b matrix of (19) the largest element of each row 
will be positive. In general, it can also be expected that the | 
largest element of each column will be positive. д 
It should be observed that the problem of determining the | 
optimal sign-changing matrices i, and в for varimax factor о 
loading matrices has been largely ignored by most investigators, - 
and that this problem is not restricted to the case where factor 
scores independent of item trait scores are sought. 
To define the trait-free factor score matrix, we return to 
equation (10). Here Wea is an item score matrix independent of the 
item trait scores, as indicated by equation (7). But the scores 
this matrix are not standardized. Therefore we let 


Vea = aD. . 
From (11), (15), and (16), we see that 


0,0,2 n- Raa, 
and therefore v,, is a matrix of standard measures. 


which enables us best to approximate v,, in the least squa 
sense. The appropriate model for this solution may be written 


да — у = в, 
where of course Vea and b are known and the i; appears on the 1 
of (26) to conform to i; in (19). 
The least squares solution for y, in (26) is well known to 
Yor = Uaizb(b'b) ^. 

From (17), (18), (19), and (27), we get 
у = QA His. 

But from (17) and (28) we may write 


Yer = SAGA Hig. 
If we let 


C = АИ 
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and 
B = AC, fi (31) 
from (29) and (31), 
Yer = Uf. (32) 
From (24) and (32), 
Yor = aD. "^B. (33) 
Let 
Bg = ОВ. (34) 
From (33), (34), and (10), 
Yer = eoll — ХиВијВе. (35) 
Let 
B, = B&Bs. (36) 
From (35) and (36), 4 
уу = Ta(Boy — ХаВи). (87) 
Finally, let 
Ви = Вы — ХаВи. (95) 
From (38) 
Yer = uber. ` (39) 
Now by definition 
ва = XS — 1M. (40) 
From (39) and (40) 
Yor = Хеви = 1Me Bar. (41) 
It can be proved now that 
уљу ут = T (42) 
and 
yat. = 0 (48) 


i іапсев 
or that the factor scores у are uncorrelated and of unit variances, 
and also uncorrelated with the item trait scores. 


608 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Suppose we wish to transform the factor scores so that they 
have means of a, and standard deviations of se. We may write 


Y, = из, + 1 Vae. (44) 
Let 
Bay = В, 8, (45) 
апа 
У, = l'a, — M/B,;. (46) 
From (43) through (46), 
Ya, = Х.В, + 17. (47) 


Suppose now we wish to construct an integer scoring matrix Es; 
from B,; such that the largest absolute value in each column of Har 
is с. We let D be a diagonal matrix of the largest absolute values 
in the columns of Вау, calculate 


B, = (B,D ")(c, + 999), (48) 
and take the integer function of Веј; thus 


Ex = int (B,j). (49) 
The maximum absolute value in each column of Ba, will now be ci. 
To further simplify the scoring, we may for any integer c, greater 
than 1 let Ё = sign Ey for Ey 5 0. With this procedure, the num- 
ber of items discarded because of all О elements in а row of Ем 
decreases as c, increases. 

For the integer scoring method for dichotomous items, one 
may reverse the keying of all items with negative scoring 
weights so that all weights become positive. 

The interpretation of the negative scoring weights is that the 
items with negative weights for a factor suppress unwanted vari- 


ance of other factors which contaminate the items with positive 
weights for the factor. 


Computational Procedure 


The computational procedure for the foregoing rationale will 
now be described. The symbol # is read “the expression on the 
left is replaced by the expression on the right.” Unless otherwise 
indicated, subscripts refer to order of matrices and reversal of 
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subscripts means transposition of the matrix. Exponents in 
parentheses mean elemental exponentiation. 

Calculate as a minor product of type II vectors (Horst, 1963, 
p. 143) 


Са = ХХ, (50) 
М. = Xal, (51) 
S» = X,,?1. (52) 


Calculate the item means (M,) and item standard deviation 
vectors: 


M, # M./n,, (53) 

S,» # S»/n,, (54) 

Sp # (So — М.) 9”. (55) 
Calculate the item covariance matrix: 

Coa # Сајт, — М.М... (56) 


Calculate the item-by-trait covariance matrix as the major 
Product of type II vectors (Horst, 1963, p. 144): 
ба DLE (57) 
Calculate the trait-by-trait covariance matrix as a major prod- 
uct of type II vectors (Horst, 1963, p. 144): 
Cu = ХС. (58) 
Calculate the inverse of a symmetric matrix (Horst, 1963, 
р. 461): 
С, # Oa. eo 
Caleulate the regression matrix for estimating the item score 
matrix from the item trait scores: 
B, = Саба. (60) 


Calculate the covariance matrix of item scores with trait 
Scores partialed out: 


= } (61) 
Le Coa # Са — СаВа 


D, = diag (б). (62) 
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Calculate 


D, # D, 7^. (63) 


Caleulate the item correlation matrix with trait scores partialed 
out: 


Coa D.C.D.. (64) 

Calculate the first n; principal axes factor loading vectors of 
Cas (Horst, 1965, p. 160): 

А = QA. (65) 


Calculate {һе varimax transformation matrix H (Horst, 1965, 
p. 428) iteratively as follows: 


В = AH, (66) 
D = diag (В'В), (67) 
D # D/n,, (68) 
B # B® — Вр, (69) 
С = АВ. (70) 
The basic structure of C (Horst, 1965, р. 437) is 
244’ = С (71) 
апа 
Н = pg. (72) 


Continue (66) throu 
Нел 


Calculate sign matrices for S, and 5,, as indicated in equations 
(19) through (23) (S, = ir, S; = ip). 


Calculate the optimal sign variamax factor loading matrix 


gh (72) until H stabilizes, beginning with 


В # S.BS,. (73) 
Calculate the varimax Tegression matrix аз follows: 
H # HS,, (74) 
H # ATH, (75) 
В = АН, (76) 


В# DB, (77) 
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Ви = Вав, (09 
В#В- Х.В,, - (9 
В # BS, (80) 

V. = Va, — М.В. (81) 


Calculate Е аз ап integer matrix from В as indicated in 
equations (48) and (49). 

Calculate as minor products of type II vectors (Horst, 1963, 
p. 148) the exact factor scores by 


уз = Х.В — 172, (82) 
the integer weight factor scores by 
ys = Хаћ, (83) 
and the item trait scores by 
у, ХЫ (84) 


Letting 


у = (Ys, Yz у.), (85) 


calculate the correlations among the three sets of у scores in 


(85) as follows: 
Calculate as a minor product Type II vector: 


Caa = Y'Y, (86) 
M, = у1, (87) 
Sp = ул. (88) 


Calculate fhe means and standard deviations for the у supet- 
matrix by 


М. # M./n, (89) 
So # Sp/n., (90) 
S» # (Sp — M,?)*^. (91) 
Calculate the correlation supermatrix for the y scores by 
« Caa # (Coa — М.М.)/(68), (92) 


where / means elemental division. 
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We may now define 


Rss Roz Ев, 
C, = |Ёкв Rez Ree (93) 
1B Ruz Bis 
Then we should have 
Ros = І, (94) 
Ев, = 0, (95) 


and Egg and Rag should approximate identity matrices while 
Ra: should approximate а null matrix. 


Kuder-Richardson Factor Score Reliability 


We may estimate the Kuder-Richardson reliability for the 
Ув scores for the j'th factor by 


Mao mea DE? (S? — B/D/'Bj), (96) 


where B; is the j'th column vector of the exact varimax weighting 
matrix which yields factor scores with standard deviations of 
Se, and D, is a diagonal matrix of item standard deviations. This 
formula is based on the assumption that the average item retest 
covariance is equal to the average inter-item covariance. Pre- 


sumably, therefore, it may markedly underestimate the retest re- 
liability of a factor score. 
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THE STRUCTURE AND CONTENT OF 
SOCIAL ATTITUDE REFERENTS: 
A PRELIMINARY STUDY? 


FRED N. KERLINGER 
New York University 


Tuts study had four purposes: to explore the measurement of 
Social attitudes by using attitude referents (objects) as stimuli; 
to assess the psychometric properties of an attitude scale con- 
structed with referents; to study the first- and second-order factor 
structures of attitude referents; and to test aspects of a structural 
theory of attitudes (Kerlinger, 1967a). The usual approach to the 
measurement of attitudes is to use statements or propositions that 
presumably reflect the attitudes. With the exception of work using 
the semantic differential (e.g., Osgood, Suci, and Tannenbaum, 
1957, pp. 104-116, 171-176, 192-195; Triandis and Davis, 1965), 
which ordinarily concentrates on the meaning of attitude concepts 
in semantic space and studies only a few concepts at a time, and 
one study done in New Zealand (Wilson and Patterson, 1968) , 
there seem to have been no attempts to use attitude referents them- 
selves as attitude items. 

16 can be argued that measuring social attitudes with referent 
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items is superior to measuring them with the usual statement items, 
One, referent items are less tied to time and place than statement 
items: “private property” is probably just as important a con- 
servative referent now as it was forty years ago, while certain 
statements about private property can easily be dated. Two, state- 
ment items are more subject to ambiguity, e.g., an item can have 
two or more attitude referents in it—and which is the subject re- 
sponding to? Three, the use of referents permits wider coverage of 
the attitude domain simply because referent items are short and 
simple. Four, referent items are less affected by peculiarities of 
wording since there are far fewer words. Five, referents, being 
single words and short phrases, are easily adapted to different types 
of items—Likert, pair comparisons, forced-choice, and so on. The 
most important point, of course, is that using referents alone may 
better permit individual differences in attitude to be measured be- 
cause the perception and interpretation of items will presumably 
be less affected by extraneous and perhaps factitious sources of 
тезропзе variance, 

Since the theory behind the study has been elaborated elsewhere 
(Kerlinger, 1967a), only an outline is presented here. Social atti- 
tudes are a subset of the domain of attitudes whose referents have 
shared general Societal relevance to many people in religious, 


turally speaking, is two dimensions or factors that are relatively 
orthogonal, 

A referent is a name, a category, a Tecurrence (Brown, 1958). 
Any Fecurrence of a social nature can be the referent of a social 
attitude. A referent, then, is а set of things toward which an atti- 
tude may be directed. “Criterial” connotes a standard or judg- 
ment; it means relevant, Pertinent, significant, A criterial referent 
of an attitude is a construct that is the focus of an attitude, that is 
significant and relevant for the individual, 

The universe of social attitude referents breaks down into two 
subsets that are expressed by the terms “liberalism” and “con- 
servatism.” For the Conservative, for example, private property, 
religion, subject matter, and certain other referents are criterial, 
while such referents as social change, civil Tights, and children’s 
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interest, not usually criterial to the conservative, are criterial to 
the liberal. The empirical implications of the theory will be dis- 
cussed later. 

The limited evidence of the validity of the theory (Kerlinger, 
1967a, 1967b, 1970; Kerlinger and Kaya, 1959) has been produced 
by the usual form of item in sentence form. As indicated earlier, 
there has been little direct attack on the referents themselves. 
Osgood, et al. (1957, pp. 104-116) studied issues and persons refer- 
ents (federal spending, General Eisenhower) but analyzed their 
semantic differential data in a manner not pertinent to the above 
propositions. Triandis and Davis (1965) had their subjects rate, 
also on a semantic differential (evaluative factor), 35 issues and 
terms on civil rights and related matters. The issues they used 
were clearly attitude referents, some of which were liberal (civil 
rights, Negro teachers) and some conservative (segregated schools, 
separate but equal accommodations). In а factor analysis of the 
issues and concepts (Davis and Triandis, 1965, рр. 160-162), the 
liberal referents and the conservative referents loaded on separate 
factors, except for one factor, which was clearly bipolar. Hofman 
(1964) had his subjects rate on a semantic differential ten referents 
related to education. Factor analysis showed that the progressive 
(liberal) referents appeared on one factor and the traditional 
(conservative) referents on another factor. Wilson and Patterson 
(1968) constructed a referents scale, using a mixture of conserva- 
tive and radical (not liberal) referents. Their items, item scoring, 
and data analysis, however, precluded the possibility of producing 
evidence pertinent to the present theory. No other studies using 
referents alone and in other than semantic differential form have 
been found. 

The following propositions guided the study. Many first-order 
factors, each characterized by either liberal or conservative refer- 
ents but not both, underlie social attitude referents, and these fac- 
tors separate into the subset categories religious, economic, edu- 
cational, and so on. The intercorrelations among the first-order 
factors will yield two orthogonal second-order factors, one char- 
acterized by liberalism factors and the other by conservatism fac- 
tors. There will be relatively little bipolarity with large unselected 
samples. When it occurs, it is а function of a sample containing а 
high proportion of extreme liberals or extreme conservatives to 
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whom the referents of liberalism or of conservatism are negatively 
criterial (see Kerlinger, 1967a, p. 115). 

Answers to certain other questions were sought. What are the 
psychometric properties of an attitude scale constructed with refer- 
ents as items? How do the first- and second-order factors of such a 
scale compare to the factors of statement scales? 


Method 


Construction of Referents Attitude Scale 


Some 400 referents were collected from several sources. The 
most important sources for conservative referents were the sys- 
tematic discussions of conservatism by Rossiter (1962), Kirk 
(1960), Viereck (1962), and McClosky (1958). Systematic dis- 
cussions of liberalism, as such, seem to be scarce. Hartz’s (1955) 
and Orton’s (1945) books were used, however. For educational 
attitude referents there are rich resources: Brubacher (1962), Dewey 
(1902), Dupuis (1966), and Morris (1961). Referents were also 
found in existing attitude scales (see Robinson, Rusk, and Head, 

\ 1968; Robinson and Shaver, 1969; Shaw and Wright, 1967). In 
addition, a number of referents were culled from the author’s own 
attitude scales and Q sorts (Kerlinger, 1956, 1967b; Shaw and 
Wright, 1967, pp. 322-324) and were written from knowledge and 
experience. 

The goal was an instrument of about 50 to 60 items. The cri- 
teria used in selection were: representativeness of the social atti- 
tude domain (religious, political, economic, educational, and social 
aspects), relative specificity (versus abstractness), clarity of mean- 
ing, and lack of redundancy. There were limitations on the applica- 
tion of these criteria, however. One, there were many more con- 
servative than liberal referents: it was much easier to find adequate 
conservative referents than it was to find liberal referents. Two, 
educational referents were included somewhat generously (20 
items) in order to provide a possible link with earlier educational 
attitude work. Three, many referents in the pool were so vague 
and ambiguous as to be useless (e.g, orders and classes, noble 
culture). x 

After repeated application of the criteria, 50 referents consisting 
of single words and short phrases, were incorporated into a 7-point 


чу" 


^i 
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summated-rating (Likert) scale called “Social Concepts,” or Re- 
ferents-I (REF-I). There were 25 liberalism and 25 conservatism 
referents with one of the latter (communism) а “negative” refer- 
ent, included for factor analytic purposes. The liberal and con- 
servative items were interspersed at random in the instrument. 
The instructions asked respondents to express degrees of positive 
and negative feeling toward each of the concepts. 


Samples and Administration of Scales 


REF-I was administered to three samples of teachers and grad- 
uate students of education in North Carolina (М = 206), Texas 
(№ = 227), and New York (М = 263) and to two smaller special 
samples in North Carolina: 64 fifth-year graduate students of 
education, reputedly more liberal than the N = 206 sample, and 
97 miscellaneous business people. The Social Attitudes (SA) Scale, 
a 26-item measure of liberalism and conservatism (Kerlinger, 
1970; Shaw and Wright, 1967, рр. 399-324), and Education Scale 
VII (Е$-УП), a 30-item measure of progressive and traditional 
educational attitudes (Kerlinger, 1967b), were also administered 
to the samples. Thus, we have six attitude measures, two each of 
liberalism and conservatism and one each of progressivism and 
traditionalism, administered to five samples in three states. 

To estimate repeat reliability, REF-I was administered twice 
at approximately one-month intervals to three additional samples: 
83 North Carolina graduate students of education, 48 North Caro- 
lina business people, and 60 New York City Police Academy cadets. 
Since a number of REF-I items might be vulnerable to response 
set tendencies, it and the following response set measures were also 
administered to 87 California liberal arts undergraduates and 32 
California business people and housewives: the Crowne-Marlowe 
(1964) Social Desirability Scale, the Couch-Keniston (1960) 15- 
item Agreement Response Scale, а shortened (40-item) version of 
the Bass (1956) Social Acquiescence Scale, and the Edwards 
(1958) Social Desirability Scale. 


Analysis 
Means, standard deviations, the correlations between the L and 
C total subscale scores, and reliability coefficients (alpha) of the 
REF-I data obtained from the five samples were calculated and 
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are reported in Table 1. Item-total correlations, L items with L 
totals and C items with C totals, were also calculated (product- 
moment 7'8) 2 The correlations among all the scale items were cal- 
culated and the resulting 50 by 50 correlation matrix was factor 
analyzed with the principal factors method (Harman, 1967), using 
squared multiple correlations in the diagonals of the correlation 
matrix. The factors were rotated both orthogonally and obliquely. 
In this paper only the oblique solutions will be considered because 
the two kinds of solution did not differ too much, and the oblique 
solutions were of course the ones used in the second-order factor 
analyses (see below). 

The promax method was used to rotate the factor matrix 
(Hendrickson and White, 1964, amended by Saunders). The cor- 
relations among the obliquely rotated factors (Thurstone, 1947, 
Ch. XVII) were themselves factor analyzed using the procedure 
described above. The resulting second-order factors were rotated 
with the varimax method (Kaiser, 1958). 

The correlations between the subscales of REF-I, the SA Scale, 
and ES-VIT were calculated as a limited test of the validity of 
REF-I. Correlations between the REF-I L and C subscales and 
the response set measures mentioned earlier were also calculated. 


Results 


Study of the statistics of Table 1 shows that the means of the 
different samples, except for the two special North Carolina sam- 
ples, the group of individuals outside the university, N = 97, and 
the group of fifth-year graduate students of education, N = 64, 
are rather similar. The differences between the means of Table 1 
are given in Table 2. These differences were tested for statistical 
significance with the Ё test using degrees of freedom adjusted for 
the different sample sizes (Walker and Lev, 1953, p. 158). The 
differences and their magnitudes are consistent with knowledge of 
the samples’ presumed liberalism and conservatism. For example, 
the N = 206 group's L mean is half a scale unit greater than the 
N = 97 group’s L mean, but does not differ significantly from the 


„ >In calculating the statistics, two items were omitted: the communism 
item mentioned earlier and one L item. The communism item was only 


included for factor analytic purposes; the L item was omitted to obtain 24 _ 
items per subscale. 
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ТАВГЕ 1 
Referents Scale I (REF-I): Means, Standard Deviations, Reliabilities, and 
Correlations Between Г, and С Subscales, North Carolina, Texas, and 
New York Samples 


ج اس 


N.C. Texas N.Y. 
N 206 97 64 227 263 
М: 
L 5.62 5.07 5.89 5.65 5.67 
с 5.79 5.84 4.81 5.47 5.57 
8: 
L .63 т .56 .60 .63 
с .60 .59 .86 .70 78 
ти 
L .83 .85 .84 .85 .83 
[^ .84 .87 .90 .88 .89 
Tho: —.15 —.01 —.17 —.07 —.21 


means of the other two university groups, № = 227 and М = 263. 
(The N = 206 C means do differ significantly from the other two 
university groups’ C means, but the magnitudes of the differences 
are small.) Evidently REF-I can successfully differentiate groups 
that presumably differ in liberalism and conservatism. 

The reliabilities of the L and C subscales of REF-I seem satis- 
factory. The alpha coefficients (rr) reported in Table 1 are all in 
the .80's. The repeat coefficients of the three special samples men- 
tioned earlier are, for L and C, respectively: .86 and .88 (N.C. 
М = 83), .73 апа 76 (МО, N = 48), and .76 and .81 (N.Y.C, 
N = 60). 

The final line of Table 1 reports the correlations between the L 
and C subscales of REF-I. They are, as expected, low and nega- 
tive, ranging from —.01 to —.21 and averaging about —.12. These 


TABLE 2 
Differences between Liberalism (L) and Conservatism (C) Means and Their Sta- 
tistical Significance" 
L 
N 9 в 227 208 97 64 227 263 
206 .55* —.28* 03 .086 —.05  .98* 32* 22* 
97 —.82*  —.58* —.60* 1.03* 36* 27 
64 24* [22 66*  —.T6* 
227 E —.10 


* Those differences marked with an asterisk are significant at least at the .01 level. If a differ- 
ence has a minus sign before it, the mean of the sample indicated at the top of the table is greater 
е mean of the sample indicated at the side of the table. 
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correlations add to the evidence cited earlier of the relative orthog- 
onality of the L and C dimensions. 

The item-total correlations, L items with L totals and C items 
with C totals, were mostly high—.40 and higher—in all samples. 
Six of the items, three L and three C, did not meet the criterion of 
.35 or greater. They were: scientific knowledge, separation of 
church and state, activity programs in schools, homogeneous group- 
ing, education as intellectual training, capitalism. Seven other 
items were doubtful: they achieved the criterion in one or two but 
not in all samples. And this lack of correlation with the subscale 
total scores is not due to the lack of variability; none of these items 
had abnormally low standard deviations. In general, 35 of the 48 
items clearly satisfied the criterion in all samples; less than ten of 
them did not, and some six or seven others were doubtful. 

While these results are encouraging, there is one aspect of the 
data of Table 1 that is puzzling. The Г, and C means of the five 
samples are too high. They run around 5.5 on a 7-point scale. To 
be sure, the discrepancies between the L and C means of the М = 
97 and N — 64 samples, the former presumably more conservative 
and the latter more liberal, are “аз they should be." But the gen- 
егаПу high level of both the L and C means—supported by the gen- 
erally high level of the means of the individual items—as contrasted 
to the L and C means of the SA Scale (see above), which are about 
а scale unit lower, is cause for concern about something other than 
social attitudes being measured by the single words and short 
phrases of REF-I. (See Discussion section.) 


First-Order Factor Analysis 


The data of the North Carolina sample, N — 206 and N — 97, 
were combined into one sample, N — 303, and factor analyzed. 
The data of the N = 64 sample were not included because the in- 
dividuals in the sample were "known" to be deviant in their atti- 
tudes (strongly progressive in their educational attitudes and prob- 
ably strongly liberal in their social attitudes, from report of the 
administrator of the scales). In fact, this sample was obtained to 
test the bipolar aspect of the theory, and the data might to some 
extent have contaminated the factor analysis of the more "reg- 
ular" data of the other samples. The Texas, М = 227, and New 
York, N = 263, samples were also factor analyzed separately. Be- 
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cause the North Carolina and Texas factor analytic and other 
results were alike, and because & large sample was wanted to 
achieve greater stability of factors, especially in the second-order 
analysis, the two samples were combined and the data of the com- 
bined sample, N = 530, became the basic data of the study. The 
results from the New York, N = 227, sample, too, were quite sim- 
ilar to those of this combined sample, but it was decided to use 
the results of the New York analysis as à replication and check 
on the results of the combined sample rather than to combine it 
with the М = 530 sample. 

The 50 by 50 correlation matrix, the unrotated factor matrix, 
and the obliquely rotated (promax) six-factor matrix are given in 
Tables A, B, and C3 A six-factor solution was chosen because no 
more than six factors seemed justified by the eigenvalues and the 
magnitudes of the loadings of the unrotated factor matrix. In addi- 
tion, comparison of the orthogonal solutions of various numbers of 
rotated factors showed that the six-factor solution not only agreed 
well with the original R matrix; it supplied а highly satisfactory 
simple structure and an even more satisfactory second-order simple 
Structure. 

The rotated factor matrix and its loadings give unambiguous 
answers to some of the questions asked earlier. The factor arrays 
of the six factors are given in Table 3. Note, first, that the liberal 
and the conservative referents are loaded positively on different 
factors. Accepting loadings of .30 ог greater 28 significant, there 
are only two exceptions: an L referent, (scientific knowledge) 
loaded positively on a C factor and the other and L referent 
(racial purity) loaded negatively on an L factor. There are three 
L factors and three C factors. Second, there is little evidence of 
bipolarity. There are negative loadings in the matrix, but they are 
for the most, part small. Third, the content of the individual factors 
is clear-cut, as will be seen. The results of the New York, М = 263, 
sample were substantially the same, except that two of the C fac- 
tors had three or four more significant loadings. In short, the evi- 

3T: ited with 
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TABLE 3 


Factor Arrays of Oblique First-Order Factors, Combined North Carolina and Texas 


Samples, У = 680% 


I. Religiosity V. Educational 
Traditionalism 
religion (.78) subject matter (.59) 
church (.73) education as intellectual 


faith in God (.72) training (.52) 
Christian (.69) school discipline (.44) 
religious education (.57) homogeneous grouping 
teaching of spiritual (.30) 


УТ. Economic 
Conservatism 


free enterprise (.62) 

real estate (.53) 

private property (.43) 
capitalism (.37) 

national sovereignty (.30) 
(scientific knowledge 


values (.53) 

moral standards in educa- 
tion (.36) 

patriotism (.33) 


La EEE 


(30)) 


П. Civil Rights III. Child-Centered IV. Social 
Education Liberalism 

Negroes (.60) children’s interests (.56) ^ Social Security (.53) 

civil rights (57) child-centered curriculum Supreme Court (.50) 

racial integration (.57) (.54) federal aid to education 

Jews (46) pupil personality (.54) (.49) 

desegregation (43) children’s needs (.52) poverty program (.48) 

(racial purity (—.37)) self-expression of children socialized medicine (47) 
(47) United Nations (.43) 


pupil interaction (.44) 
child freedom (.37) 


a The loadings аге given in parentheses, Loadings >.30 were considered signifi 
1 а . >. ignificant. The two 
parenthesized referents are one L item loaded on a C factor and one C item loaded on an L factor. 


three C factor categories 
› With two non-economic 


IV, however, includes economie and “social” items, The L factors 


are named: II: Civil Rights; III: Child-Centered Education; IV: 
Social Liberalism. 
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Second-Order Factor Analysis 


The correlations among the six primary factor vectors are given 
in Table 4. The correlations among the L factors are positive; 
those among the C factors are also positive. The cross-correla- 
tions, those between L and C factors are low positive and low nega- 
tive. 

The factor analysis of the correlation matrix yielded two clear 
factors: the first three eigenvalues were 1.76, 1.14, and .19. The 
unrotated and rotated factors, with the L and C identifications of 
the factors, are reported in Table 5. The simple structure could 
hardly be better. Note that the C first-order factors are loaded on 
the first rotated second-order factor and hardly at all on the second 
factor, and the L first-order factors are loaded on the second factor 
and not on the first factor, except for II, which has a low loading of 
—.22. The predicted second-order factors are right on the theo- 
retical target. A plot of these loadings shows an almost perfect 
orthogonal structure, with the L factors closely clustered near one 
axis and the C factors near the other.“ 


Correlations with Other Scales 

The correlations of REF-I, L and C, with the Social Attitude 
(SA) Scale, L and C, ranged in the five samples from 43 to 58 fr 
L and from .54 to .66 for C. The average T'S, via 2, were 51 and 


TABLE 4 
Correlations among Primary Factors, Combined North Carolina and Texas Samples, 
М = 530° 
f II ш IV Y VI 

I 1.00 —.24 11 —.15 .57 .39 с 

p .39 таз ав 

ш 1.00 .37 15 09 L 

Iv и. 

Vi 1.00 I) 

VI 10 С 


4 Similar second-order factor analytic results have been obtained with the 
author's social attitude scale and educational attitude scales. See the second- 
order factor matrices and the plots of the second-order factors in the theoreti- 
sel article cited earlier (Kerlinger, 1967a, рр. 118, 119). The New York, 
= 263, sample yielded highly similar results. The eight- and ten-factor 
тоне. solutions of the N = 530 and N = 963 data had somewhat more 
ipolarity—one or two loadings between — 30 and —40. There are also data 
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TABLE 5 


Unrotated and Rotated Second-Order Factor Matrices, Combined North Carolina and 
Texas Samples, У = 530“ 


Unrotated Matrix Rotated Matrix Factor Type 
I .69 .19 .71 —.09 с 
II —.44 .51 —.22 .64 L 
III —.05 .64 19 .61 L 
IV —.36 .55 —.13 .65 L 
v -70 33 78 .04 с 
ҮІ e 


168 ла 168 – 12 
* Significant loadings (2.35) are italicized. 


59. The L-C cross-correlations’ range was —.33 to —.53, with an 
average of —.40. The C-L range is — 15 to — 47, with an average 
of —.30. The average correlations of ВЕЕ-Т, Г, and C, with Edu- 
cation Scale VII (ES-VIT), progressivism and traditionalism, were 
-52 and .50. The average cross-correlations, L with traditionalism 
and C with progressivism, were —.25 and — 11. The congruent 
(L with L, C with C, ete.) correlations were thus satisfactory 
(>.50). The cross-correlations, however, were not in line with the 
theoretical expectation of пеаг-лего cross-correlations. Perhaps the 


are encouraging. Of 18 78 only two are statistically significant 
(p < .01): the Crowne-Marlowe Scale with С, .37, and the Bass 
Scale with C, .25, both arising from the California, N = 87, sample. 
The comparable Crowne-Marlowe and C r in the North Carolina, 
N = 83, sample was —.07, not significant, and the comparable Bass 
and C r in the California, N = 32, sample was .15, also not signifi- 
cant. Most of the other r’s hovered around zero. Evidently REF-I is 
not too Seriously affected by response sets of the kinds measured by 
these scales. 
7 Discussion 

The Propositions of the Structural theory of attitudes discussed 
earlier seem to be Supported by the results of this study. First, 
there are many first-order social attitude factors that are char- 


where the picture is not so neat. With some hen 
sumone samples or more factors and whe: 


ment and referent items is analyzed, more bipolari t 
3 1 р polarity enters 
- picture. The L and С items, however, have mostly appeared on different 
TS. 


¥ 
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acterized by the rubrics religious, political, educational, economic, 
and general social. Second, these first-order factors are either lib- 
eral or conservative, but not both. That is, liberalism referents do not 
appear on conservatism factors, and conservatism referents do not ap- 
pear on liberalism factors. The third point is implied by the second: 
relatively little bipolarity appears in the correlations and the factor 
analysis of the items. Finally, and perhaps most important, two 
relatively orthogonal second-order factors underlie the first-order 
attitude referent factors, one associated with liberalism first-order 
factors and the other with conservatism first-order factors. 

Basically the same second-order factor structure seems to under- 
lie both attitude statements and attitude referents. It is not possible 
yet to say that the first-order factors are the same or similar be- 
cause the research was not specifically designed to study first-order 
statement and referent factor comparability. The factor arrays of 
studies of educational attitude statements (Kerlinger, 1967b) and 
those of two unpublished studies of social attitude statements show 
similarities that indicate family resemblances between statement 
factors and referent factors. For example, Religionism, Economic 
Conservatism, Social Liberalism, and certain educational attitude 
factors appear in both statement and referent factor arrays. 

The most difficult part of the theory to test adequately is the no- 
tion that bipolarity is not a basic feature of social attitudes. Al- 
though little bipolarity appeared in this study and in earlier stud- 
ies, it is fairly clear that bipolarity will appear under conditions 
of sampling (see earlier discussion) and with certain kinds of items. 
Bipolarity will appear, essentially, when conservative referents are 
criterial to liberals and liberal referents are criterial to conserva- 
tives—in both cases negatively. With a large proportion of John 
Birch Society members in a sample, one would expect bipolarity 
to appear because, although conservative, such individuals are evi- 
dently more anti-liberal than they are conservative (Bell, 1964). 
Presumably the same is true of extreme liberals or radicals. SDS 
members, for example, do not seem so much to support liberal 
issues—indeed, they are anathematized—as they oppose any estab- 
lishment values (see Berger, 1969) .5 


—_ 

5 Опе wonders, however, whether such a formulation is valid with student 
radicals. It seems that their value and attitudes predilections may be orthogo- 
nal to those of liberals and conservatives. Perhaps they largely reject both 
L and С referents and respond positively only, to other kinds of referents. 
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There is evidence, on the other hand, that bipolarity is an in- 
tegral part of social attitudes (e.g., Comrey, 1966; Comrey and 
Newmeyer, 1965).° Comrey (1966) even says that there is a strong 
(bipolar) social attitude factor. The substantial negative cross- 
correlations between the L and C subscales of REF-I and the SA 
Scale reported earlier are also at variance with the theory. At 
present, then, the bipolarity issue is far from settled. The evidence 
of this study and earlier studies, however, casts serious doubt on 
the accepted assumption of bipolarity, or a single liberalism- 
conservatism dimension. 

Referents can evidently be used as attitude items. The evidence 
indicates that REF-I’s L and C subscales are reliable and fac- 
torially valid. In addition, the subscales successfully differentiate 
groups with different attitudes, though this evidence is not as strong 
as the factor analytic and reliability evidence. REF-I has a trouble- 
some defect, however. Its Z and C means and the means of the in- 
dividual items are too high (around 5.5 on 7-point scale), perhaps 
indicating that REF-I is also measuring something other than so- 
cial attitudes. Immediate candidates are social desirability or so- 
cial acquiescence, except that the evidence presented earlier on the 
correlations between the REF-I L and C subscales and the response 
set measures was negative. 

А simpler interpretation may be in order. Since the REF-I 
items, with one exception (communism), are all positive, they may 
reflect general approval of the social, economic, religious, and edu- 
cational institutions and practices that they presumably reflect. 
That is, most people generally approve, if perhaps not with equal 
enthusiasm, such varied notions as faith in God, children’s needs, 
private property, family, scientific knowledge. And there are 
enough of such generally approved referents in REF-I to elevate 


$ Other evidence and comment on such evidence are given in the theoretical 
article cited earlier (Kerlinger, 1967a). It is extremely difficult to disentangle 
published results, In most cases of reported bipolar social attitudes, inadequate 
assumptions or analyses are part of the picture. For example, bipolarity is 
simply assumed and item and scale scoring adapted to the assumption. Or 
unrotated factors, which are artifactually bipolar, are interpreted (see Shaw 
and Wright, 1967, Ch. 17). I have reanalyzed the data of several studies and 
in most of the cases the presumed bipolarity has virtually disappeared. The 
Comrey and Newmeyer (1965) data were not so tractable: bipolarity was 
strongly evident. But it is not clear whether the bipolarity was “legitimate” or 
partially artifactual (reverse scoring, for example). 
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the means. Study of the high means (> 6.2) showed substantial 
congruence across samples, and these means reflect the general ac- 
ceptance quality mentioned above. (The example referents just 
given were high in at least three different samples.) One is re- 
minded of the studies and findings of Prothro and Grigg (1960) 
and McClosky (1964). In both studies, the responses of most sub- 
jects to abstract attitude statements were strongly positive, but 
agreement dropped sharply when the statements became specific. 
Since many or most of the referents of REF-I are quite abstract, by 
their nature as concepts divorced from specific situations, the high 
means may simply reflect the same strong endorsement of abstrac- 
tions that Prothro and Grigg and McClosky found. Such a post 
hoc analysis, of course, can only be suggestive. Systematic research 
will have to be done to explore alternative interpretations. For 
measurement purposes, this difficulty can to some extent be avoided 
by using the referents in forced-choice formats: paired comparisons, 
tetrads, pentads, and rank order. 

In sum, the data of the study add strong evidence to earlier evi- 
dence that the theoretically predicted dualistic social attitude 
structure is “real,” that bipolarity is not the important element of 
attitudes it has been conceived to be, and, perhaps most important, 
that referents are the substantive basis of social attitudes. It must 
quickly be added, however, that the study produced minimal di- 
rect evidence of the importance of criteriality of referents. The 
theory actually depends upon the notion that different referents 
are differentially criterial for different individuals and sets of in- 
dividuals, and the only direct evidence came from the comparisons 
of the means of the different samples (see Tables 1 and 2 and 
accompanying discussion). Nevertheless, it is hard to conceive 
how the first- and second-order factors would have emerged as 
they did if the referents were not differentially criterial to differ- 
ent individuals. 

In one sense, attitude referents are the most important parts of 
attitudes. They are the core of the cognitive component of attitudes. 
To be sure, the evaluative (in the semantic differential evaluative 
factor sense), emotional, and motivational components of attitudes 
may be more important in actually influencing behavior. But the 
behavior is in large part triggered by and directed toward refer- 
ents. In other words, the emotional, motivational, and evaluative 
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aspects of attitudes cannot work without the cognitive substance 
that is the referents. Referents, therefore, should be extensively 
and intensively studied. They are not only the core of the cogni- 
tive component of attitudes. They may also provide a bridge be- 
tween attitudes and values. As referents become more abstract, for 
instance, do attitudes approach values? Do children learn their 
values and attitudes by learning the criteriality of sets of refer- 
ents? The answers to these and other related questions should help 
to build a substantial body of scientific knowledge that will help 
to explain attitudes and behavior influenced by attitudes.” 
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WITHIN the past few years, the use of personality tests for per- 
sonnel selection in industrial and governmental agencies has come 
under criticism. Congressional hearings into the use of personality 
tests were held (American Psychologist, 1965) ; and, at one point, à 
bill prohibiting the use of psychological testing in the selection of 
federal employees was introduced in the Senate Chambers. Oppon- 
ents of psychological testing have voiced several types of objec- 
tions: the fear of extensive government files being accumulated on 
every citizen; the possibility of creativity being squelched (Whyte, 
1956), and, the most common complaint, that testing constitutes 
an invasion of privacy, Critics such as Martin Gross (1962, 1967) 
argue that the tests are invalid for the purposes for which they are 
used, and, more basically, that there is no justification for the com- 
mon practice of asking questions of a personal, private nature. 
These views are shared by many, and are often expressed by those 
who are required to take the tests under employment selection 
Conditions. Some psychologists (Lovell, 1967) maintain that cer- 
tain types of tests such as personality inventories, should not be 
used for personnel selection. 

Personality inventories which have demonstrated validity and 
which are easy to administer are likely to enjoy an even broader 
use in the future; and, as personality inventories like the MMPI 
come to be used more as screening devices on a “normal popula- 
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tion” for purposes of placement and job selection, the problem of 
test acceptance could become more prevalent. 

Two possible ways of alleviating or reducing the objection to in- 
ventories have been suggested: one involves simply removing or 
allowing the respondent to omit those items which he views as 
overly personal or offensive. The results of studies in which this 
technique was used have not been encouraging. Previous research 
using the MMPI (Butcher and Tellegen, 1966) has shown that the 
number of items commonly objected to cannot be eliminated from 
the test without affecting several of the existing clinical scales. 
Results of studies on the use of instructions which permit the re- 
spondent to omit offensive items (Walker, 1967; Walker and Ward, 
1969) suggest that although reliability was lessened, no mean 
profile differences were obtained. Moreover, it has not been demon- | 
strated that the opportunity to omit items will cause a respondent 
to view the test in a more favorable light. The individual is still 
being asked the questions and he has no way of knowing how his 
omission of some items will be interpreted. Therefore, it is not 
clear that allowing subjects to omit items makes the test-taking 
situation less objectionable. 

A second approach that might be taken to reduce objections to 
personality inventories would be informing the public, partic- 
ularly those being tested, of the nature of the testing, e.g., what is 
the test, how was it derived, how will it be interpreted and used, 
and what can the respondent hope to gain (Meehl, 1969). While 
such a procedure might be impossible with some diagnostic pro- 
cedures such as projective techniques, there is no obvious reason 
why it could not be employed with empirieally derived and scored 
inventories such as the MMPI. Hathaway (1964) saw the potential 
reassuring effect of such an approach several years ago. In his re- 
sponse to a layman who had objected to his having been forced to 
take the MMPI as part of a job application, Hathaway described 
the derivation of the scales, the mechanical method of scoring the 
responses, and what psychologists hope to learn from the tests. An. 
educating approach of this type may offer the best way of increas- 
ing the acceptance of personality inventories. 

The present study is an attempt to evaluate what effect detailed, 
well-defined instruetions have upon responses to а personality in- 
ventory and attitudes toward the test situation. 
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Subjects were 100 undergraduates enrolled in the introductory 
psychology course at the University of Minnesota. The Ss were 
volunteers who participated in order to receive experimental points 
which would be added to their grades. 

The MMPI and the Personality Research Form (PRF) were 
administered with either normal instructions or instructions modi- 
fied to reduce objections to the test. In the development of the 
PRF, саге was taken to insure its appropriateness with normal 
subjects (items involving extreme deviancy were kept at а min- 
imum), and equal consideration was given to the problem of de- 
sirability as a response style (Jackson, 1965). Because of these 
factors, the PRF was included as a source of potential comparison 
with the more clinically oriented and possibly more objectionable 
MMPI, The instructions which appear below specify in a straight- 
forward manner the empirical validation of the scales, the existence 
of the validity scales, and the manner in which the protocols are 
commonly interpreted. 

The following personality inventory is made up of many state- 
ments; you are to decide whether the statements are mostly true 
as applied to you or mostly false, and then fill in the appropriate 
spot on your answer sheet. In taking the test, some people have 
been concerned about certain things. For example, some people 
wonder how honest they have to be in responding to the items. In 
the development of the inventory, several scales were constructed 
to allow the interpreter to evaluate test-taking attitudes. In other 
words, people who give an overly virtuous picture of themselves or 
people who try to appear more psychologically disturbed than they 
are easily detected. Thus, their test protocols are invalid and have 
to be discarded. 

It is also important to realize that this inventory was developed 
as a way of measuring individual personality traits not just to de- 
tect if a person is insane or not. We all know that every person is 
different, that is, has a different personality, and that they are 
better suited for certain things because of this, This test helps а 
psychologist understand what an individual’s personality is like, 


1 It should be pointed out that the study was concerned primarily with the 
MMPI; hence, these instructions were not totally apropos to the PRF. 
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and, by this, enables him to advise and help in a more efficient | 
manner. 
Some of the statements may seem unrelated to anything about 
your personality, and other items may seem too personal. A word, 
then, about how these items were chosen. A large list of state- 
ments was given to a group of normal people and to people suffer- | 
ing from many kinds of personality problems. Then, the state- | 
ments that were answered with different frequency by the two 
groups were selected as a scale, and it was shown that people hav- _ 
ing certain kinds of personality structures will answer these items 
in similar ways. There may be no logical reason why, for example, 
ап unhappy person will answer items about sex, or religion, or 
body functions in a certain way, but he does. So, the important | 
thing to remember is that test interpretation does not involve 
reading your specific responses, Scoring involves simply placing a | 
scoring stencil over the answer sheet and counting the responses 
for each personality scale. This allows us to compare your total 
responses on each scale with other people. 
We hope that you will answer all the items, unless they really do А 
not apply to you. 
The order of the tests within groups was counterbalanced, and, | 
since the items from both tests had been retyped, subjects were at 
no time aware that they were responding to established psycho- | 
logical inventories. It was hoped that in this manner preconceptions | 
regarding particular tests would not influence the results. 3 
Following completion of the inventories, a short questionnaire _ 
designed to assess the effectiveness of the instructions was admin- 
istered: 
Imagine that you had been asked to take this test as part of the Я 
selection procedure іп а job application. 
1. Would you have felt that some of the items were (highly _ 
offensive, mildly offensive, not offensive). (Please circle one.) 
2. In your opinion, would this test constitute an invasion of. 
privacy? (Yes, No, Unsure.) 
3. Would you feel (highly anxious, mildly anxious, not о 
about the results of the test? 


Results and Discussion 


The major concern of the study involved what changes in in- 
structions may have caused on the MMPI and PRF results. ТО 
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assess this, analyses of the within-sex mean scale scores for both 
the MMPI and the PRF were done, comparing the special instruc- 
tion with the PRF comparisons (see figures) and the means and 
variances of the scores from both groups on all scales were com- 
pared. The results of these analyses generally supported the posi- 
tion that special instructions do not affect the mean profiles of the 
respective tests. Of the 72 comparisons possible, only five such 
comparisons of means by t-test reached significance at below the 
05 level. On the PRF, the means of three scales were significantly 
different—the Nurturance scale for both male and female and the 
Succorance scale for females. Since only three of 44 possible РВЕ 
scale comparisons were significant, an explanation simply in terms 
of chance seems most parsimonious. Absolute differences, moreover, 
are quite small, as can be seen in Figure 1 where the mean PRF 
profiles of experimental and control groups are presented. 
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Figure 1. Mean PRF Profiles for Male and Female Subjects in the Experi- 


mental (Special Instruction) and Control Conditions. 
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A similar lack of significant differences is evident from an inspection 
of the mean MMPI profiles. The only scale significantly affected 
by the instructions was the ? score. Both males and females of the 
special instruction group omitted fewer items than controls. Since 
the mean number of omissions in either group is very low, the dif- 
ferences are negligible. This may suggest a more accepting attitude 
toward the test on the part of Ss in the special instruction groups. 

It may be noted that the difference between male groups on the 
К scale approached significance (t = 1.99, р < .10). This might be 
expected since the instructions could easily have tended to lower 
defensiveness in the test-taking situation. This difference was ab- 
sent in the females, however; in fact, a slight trend in the opposite 
direction was found. Interpreting this difference as a chance finding 
is probably justified. 

Mean profiles on the MMPI are shown in Figure 2. Visual in- 
spection shows little difference of practical significance. 

The findings from the comparison of group variances are some- 
what more ambiguous. On the PRF only two of 44 comparisons 
reached significance at below the .05 level. On the Nurturance 
scale for females, the special instruction group has a greater vari- 
ance than the control; a significant difference in the opposite di- 
rection was found on the Play scale for females. 
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Figure 2. Mean MMPI Profiles for Male and Female Subjects in the 
Experimental (Special Instruction) and Control Conditions. 
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On the MMPI, eight significant variance differences were found. 
The special instruction group has smaller variances than the control 
on the 2 scale for both sexes, and on the L scale for females. The 
instructions had the effect of increasing variance on the following: 
F scale—male, Pa scale—female, Pt scale—female, Sc scale—male, 
Si scale—male. 

The greater number of significant differences in the MMPI over 
the PRF is probably a function of the former test’s greater het- 
erogeniety of item content in each scale as well as its greater range 
of possible scores, MMPI scales numbering between 40 and 70 items 
while the PRF has a uniform 20 items per scale. In examining the 
variance differences on the MMPI, the only consistent finding 
seems to center around several scales which are often influenced by 
the willingness of respondents to admit to undesirable thoughts 
and symptoms. Although the means remained unchanged, the in- 
structions apparently had the effect of spreading out the scores on 
such clinical scales as Pa, Pt, Sc and Si as well as the F scale. 
While caution in interpreting either extremely high or extremely 
low scores would be indicated, test validity is not necessarily 
weakened, and, if anything, is probably enhanced, since the instruc- 
tions appear to accentuate individual variability. 

In summary, then, it would appear that no significant mean profile 
differences result from the use of detailed educating instructions; 
and the variance changes which occur do not weaken test validity. 
The next question of interest concerns whether the instructions 
have altered the subject’s feelings about the tests. Are such detailed 
instructions a reasonable way to reduce objections to personality 
inventories? Summary data from the three questionnaire items are 
given in Table 1. It can be seen that subjects in the special instruction 
group viewed the test as less offensive (Question 1), and fewer Ss 
felt that the tests constituted an invasion of privacy (Question 2). 
The instructions, however, had no effect on the amount of concern 
experienced by subjects regarding results (Question 3). These data, 
together with the finding that the ? low score of the MMPI in the 
special instructions group, would seem to indicate that instructions 
explaining the psychological test and giving reassurance т 1 
its use, do decrease resentment toward the test-taking situation. 
One problem inherent in the present study involves the test-taking 
set of the Ss. They knew during the testing that their responses 
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would be utilized for research purposes and their set for the ques- | 
tionnaire was also hypothetical. In an actual job application situation 
the responses might be considerably different. 

In summary, if personality inventories similar to the MMPI 
and the PRF are to be used as screening devices in selection situa- 
tions, the public resentment toward them must be reduced. Al- 
though it has been suggested in some studies that allowing respon- 
dents to omit items to which they object is feasible, there is no 
evidence to indicate that such an approach is effective in reducing 
voiced objections. Yet, many people react to a personality inven- 
tory with curiosity or fear; and, if the psychologist recognizes 
these feelings and concerns and makes an effort to alleviate them 
and clarify some common misconceptions about the test, then mis- 
trust and objections may diminish. It has been demonstrated in the 
present study that these negative attitudes can be modified with- 
out altering test profiles. 

Opportunities for the use of explanatory instructions with the 
MMPI and PRF are many. With slight alterations, the instruc- 
tions used in the present study could possibly be considered for use 
in a variety of situations which in recent years have produced ob- 
jections to personality inventories. In situations where these or 
similar standardized instruments are used for screening purposes 
with a normal population, such instructions might be beneficial. If, 
in fact, these instructions reduce the resentment of job applicants 
to taking personality inventories, they obviously should be used. 
Since this study does not deal with a patient population, it is not 
known what effect altered instructions may have in a clinic setting _ 
when testing disturbed people. Occasions frequently arise in the 
clinic, however, when it is necessary to assess “normal” family 
members. The administration of such tests is frequently resented— 


TABLE 1 
The Effects of Special Instructions on Ss Perception of Item Offensiveness, Invasion 
of Privacy and Anxiety over Results 
Question 1 Question 2 Question 3 


Item Offensiveness Invasion of Privacy Anxiety over Results 
Highly Mildly Not Yes No Unsure Highly Mildly Not 


Exp. 3 17 30 3 40 7 12 21 17 
Control 12 15 23 15 26 9 12 24 14 
X? = 6.450, p .05 X? = 11.220, p. 005 X? = .490, N.S. 
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а common example is found in child guidance settings where pa- 
rental assessment is routine yet often not understood by the parent 
who is suddenly faced with tests asking quite personal and, to him, 
threatening questions. The use of reassuring, educating instruc- 
tions in this situation may be valuable in reducing objections or 
resistance. 

Finally, instructions altered in this manner could make per- 
sonality inventories like the MMPI considerably more acceptable 


to the subjects in a research study. 
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Arter many years of neglect of serious inquiry on personality 
organization in ethnie minority groups, the current psychological 
literature reveals a sharp contemporary interest in studying Negro- 
white personality differences (Pettigrew, 1964; Deutsch, Katz, and 
Jensen, 1968; Dreger and Miller, 1968). Аз such inquiry develops, 
there is mounting evidence that generalizations derived from studies 
of white samples do not hold up with Negro subjects (e.g. Carl- 
son and Levy, 1970; Gurin, Gurin, Lao and Beattie, 1969; Hede- 
gard and Brown, 1969; Lott and Lott, 1963). Motivational, cog- 
nitive, and affective variables appear to be patterned differently 
in studies of performance, of sex-identity, and of self-conceptu- 
alization in different ethnic groups. Moreover, the very complexity 
of the emerging patterns suggests that dimensional approaches in 
personality assessment are not fully adequate tools for capturing 
ethnic differences in personality, and that more complex, typolog- 
ical approaches may be more useful. 

Among the available typologies, that of Jung (1923) appears 
especially promising as & conceptual framework capable of repre- 
senting the organization of cognitive, affective, and temperamental 
qualities within the individual. Although Jungian theory has not 
been influential in American academic psychology (beyond a sim- 
ple and somewhat misleading adaptation of “extraversion” and 


1 Requests for reprints should be sent to Nissim Levy, Department of 
Psychology, Howard University, Washington, D. С. 20001. 
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"introversion"), within recent years a standardized instrument, - 
the Myers-Briggs Type Indicator (Myers, 1962) has been вре- ЈЕ 
cifically developed to assess the Jungian typology. Critical evalua- _ 


number of theoretically oriented studies using the instrument, | 
are found in recent literature (Myers, 1962; Ross, 1966; Shapiro _ 
and Alexander, 1969; Stricker and Ross, 1962, 1964a, 1964b). - 
However, virtually all of the standardization, construct valida- _ 
tion, and theoretical inquiry with this instrument has been based 
upon various white middle class subject samples. Thus, the poten- _ 
tialities of Jungian theory or of the Myers-Briggs Type Indicator | 
for describing ethnic personality patterns have remained unex- | 
plored to this time. Б 

The present paper examines type distributions of Negro college _ 
students, compares these with findings of earlier studies of white | 
college students, and provides evidence on the stability of person- _ 
ality types as measured by the Myers-Briggs Type Indicator. 


Description of the Typology 


Jung’s psychological types have been so thoroughly presented in | 
several sources (Dry, 1961; Fordham, 1953; Jung, 1923, 1983; 
Munroe, 1955; Myers, 1962) that only a bare outline of the typol- _ 
ogy is sketched here. Since we are here concerned only with con- _ 
sciously developed attitudes and functions, aspects of Jungian 
theory dealing with unconscious development of their counterparts q 
are omitted in the present discussion, along with such other im- _ 
portant components of the typology as distinctions between dom- _ 
inant and auxiliary functions. Fuller discussion of the theory may 
be found in the references cited. 

The Jungian typology involves a set of interlocking dimensions: 
the “attitudes” of Extraversion and Introversion, the four 
“functions” of Sensation, Intuition, Thinking, and Feeling, which | 
are manifested at two levels of conscious development. Extraversion | 
and Introversion represent general orientations; the extravert а 
tends to the object, to the external world, while the intro- | 
vert focuses upon internal representations of experience. Extras 
version and Introversion operate in four functional modes. 
of these—Thinking and Feeling—involve organization and judi 
ment; Thinking organizes experience in terms of logical, intellectu: 
features, while Feeling organizes experience in terms of liking OF 
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disliking. The other two functions lack this organizing-judging ef- 
fect, and aim at perception: Sensing notes the presence and qual- 
ities of things; Intuition leaps beyond sensory notation and recog- 
nizes latent possibilities in things and events. 

Through innate predisposition and environmental opportunity, 
one of each pair is more “natural” or developed in the individual. 
Thus, the person characteristically directs his cognitive functioning 
either toward the outer world (extraversion) or toward subjective 
experience (introversion) , and comes to emphasize one of the 
judging functions (thinking or feeling) and one of the perceptual 
functions (sensation or intuition) as his preferred, most character- 
istic mode of dealing with experience. 


Description of the Myers-Briggs Туре Indicator 


Myers (1962) has translated the Jungian type theory in а self- 
report inventory designed to assess а person’s predominantly con- 
sciously developed attitudes and functions. The Type Indicator 
provides measures of Extraversion and Introversion (Е-Г), Sensa- 
tion and Intuition (S-N), Thinking and Feeling (T-F) and an 
additional variable, Judging or Perceiving (J-P) which reflects the 
dominant function within the individual. Myers follows Jung in 
pointing to the categorical and interacting nature of the dimen- 
sions. “The main purpose of the Indicator is to ascertain а person's 
basic preferences. ЁТ, SN, TF, and JP are therefore indices de- 
signed to point one way ог the other, rather than scales designed to 
measure traits. What each is intended to reflect is а habituel 
choice between opposites, analogous to right- or left-handedness. 
Thus EI means Е or 1, rather than Ё to I (p.2) 

Two kinds of measures are obtained from this jnstrument: con- 
extent of development of various 


tinuous scores, representing the ; 
dimensions, and the more basic type categories. Sixteen basic types; 


representing qualitatively different, patterns of organization of the 
basie Jungian variables, comprise the resulting typology. (Fuller 
discussion of theory, rationale, and psychometric properties of the 
instrument may be found in Myers, 1962, and Stricker and Ross, 
1962). 


Method 


Subjects included 758 Negro undergraduates (311 males, 447 
females) enrolled in several undergraduate courses at Howar 
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University. The Myers-Briggs Type Indicator was administered 
in regular class sessions early in the spring semester, and additional 
data were obtained on age, year in college, academic major, birth | 
order, number of siblings, and parents’ occupations. Approximately - 
two months later the test was readministered to 433 subjects (146 
males, 287 females). 

Stability of continuous scores was tested with Pearson 7's; com- 
parisons of type distributions among Negro and white subjects 
were evaluated solely descriptively. 


Results 


'Test-retest reliability coeffieients for continuous scores on the 
four major variables (Table 1) indicate that measures of all di- 
mensions are statistically reliable. Table 1 also shows that the 
only previous measure of the stability of this test’s continuous 
scores (based upon retest of 41 Amherst freshmen after a 14-month 
interval) may underestimate the stability of these scores. With 
the exception of S-N scores for Howard vs. Amherst males, all 
reliability coefficients were significantly (Fisher 2, p < 01) greater 
among Howard students. Whether these differences are based upon — 
the larger sample or upon the shorter retest interval cannot be de- 
termined from the present data. 

Evidence of the stability of type-classifications, shown in Table 
2, is presently unique in the literature, since no known investiga- 
tions have reported changes over time in the patterns of type 
classifications. Table 2 indicates that for the majority of subjects, - 
the complete type-classification is stable over a 2-month interval, 
while 88 per cent of subjects display either complete stability or à 
shift in only one of the four basic variables. As might be inferred 
from the essentially rectangular distribution of reliability scores 


TABLE 1 
Test-Retest Reliability Coeficients for MBTI Continuous Scores 


Howard Males Howard Femaless Amherst Males? 


Scale (N = 146) (N = 287) (N = 41) 
E-I .80 .83 

S-N 69 E "08 
T-F .73 .82 .48 
J-P -80 .82 .69 


* 2-month retest interval. 
b 14-month retest interval (Stricker and Ross, 1962, р. 102). 
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TABLE 2 
Agreement in Original and Retest Type Category (N = 433) 


Number of changes in type classification N % 
0 scales 230 53 
1 scale 150 35 
2 scales 44 10 
3 scales 9 2 


4 scales 0 0 
shown in Table 1, the shifts on the four scoring variables are ap- 
proximately equal in prevalence. 

With the stability of the Myers-Briggs Type Indicator measures 
established in the present analyses, the more important substantive 
issue of ethnic differences in type distributions is summarized in 
Tables 3, 4, and 5. It should be noted that the white undergraduate 
samples consisted of freshmen, while the Howard sample represents 
all class levels, with an emphasis upon lower-division students. 
Separate analyses for each sex, comparing white freshmen with а 
subsample of Howard freshmen, gave results entirely consistent 
with the findings reported in Tables 3, 4, and 5. 

Type distributions for Negro male undergraduates, and com- 
parisons with standardization samples of white male undergrad- 
uates are shown in Table 3. Striking differences between the Negro 
and white samples can be seen} approximately one-fourth of the 
Howard male sample is categorized as ESTJ, as compared with 
only 9.3 per cent of the "Ivy League” white male sample. The di- 


TABLE 3 
Percentage Frequencies of the 16 Types among Negro (N 831) and White (N 23676) 
Male College Students” 
ISTJ ISFJ INFJ INTJ 
7.8 4.2 5.0 7.8 
15.0 8.1 2.6 4 
ISTP ISFP INFP INTP 
3.3 2.8 8.0 1.8 
3.6 1.6 3.8 4.8 
ESTP ESFP ЕМЕР ЕМТР 
3.8 4.3 9.4 У d 
4.9 2.6 4.5 
ESTJ ESFJ ЕМЕЈ ENTJ 
9.3 5.9 5.8 7.5 
24.5 7.5 5.2 5.5 


* Entries for Howard Negro males shown in italics. 
b White males included liberal arts students from ‘Amherst, Brown, Dartmouth, Stanford, and 
Wesleyan (Myers, 1962, Appendix D-5) 
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versity of types among white students (i.e. no type classification 
characterizes 10 per cent of the white male undergraduates) is not 
matched in the Howard sample where nearly half of the subjects 
are STJ's. 

Comparable data for Howard Negro females, and comparisons 
with Pembroke white females, are presented in Table 4. Again, 
striking ethnic differences are found. Over one-fourth of the Negro 
women are SFJ's, while the major types found among Pembroke 
white females are NFP's. 

When Negro and white undergraduates of both sexes are com- 
pared in terms of the major scoring variables (Table 5), salient 
differences emerge with great clarity. Sex differences, while sig- 
nificant on several dimensions, are over-shadowed by ethnic differ- 
ences. Howard Negro subjects—males and females alike—are 
clearly more often Sensing and Judging types in comparison with 
their white undergraduate counterparts. Sex differences within both 
ethnic groups are in the directions one would expect from theory 
and from previous work with this instrument (Myers, 1962); 
males, in general, are more likely to be extraverted, sensing, think- 
ing, and judging, as compared with females. However, the Negro 
females are significantly more often sensing and judging types as 
compared with white males. Implications of this distinctive ethnic 
patterning are considered in a later section. 

In an attempt to understand the type distributions found in the 
Howard sample, effects of а number of background variables were 


TABLE 4 
Percentage Frequencies of the 16 Types among Negro (N = 447) and White (N £40) 


Female College Students» 
ا ڪڪ‎ 

ISTI ISFJ INFJ INTJ 
4.6 5.0 5.4 4.2 
11.8 13.4 4.0 2.4 

ISTP ISFP INFP INTP 
1.7 2.5 11.8 7.6 
4.9 6.7 8.5 

ESTP ESFP ENFP ENTP 
2.5 4.2 19.6 5.8 
2.6 85 8.2 4.0 

ESTJ ESFJ ENFJ ENTJ 
2.5 7.6 10.4 5.4 
10.0 12.7 6.0 24 


* Entries for Howard Negro females shown in italics. 
b White females were Pembroke students (Myers, 1962, Appendix D-5) 
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TABLE 5 
Type Classifications of Negro and White College Students 
Proportion in each Type Category 
Type Negro Males Negro Females White Males* White Females* 
Classification (N = 311) (N = 447) (N = 3676) (N = 240) 

E (E-I) 57.3 50.1 54.1 48.0 
8 (S-N) 67.8 62.2 9 30.6 
T (T-F) 64.2 40.0 54.4 34.3 
J (J-P) 73.3 62.7 52.3 45.1 


a Data from Myers, 1962, Appendix D-5. 


studied. Age, college-year, and academic major were clearly un- 
related to variables of the typology. Only two of the demographic 
variables examined showed significant relationships to type pattern- 
ing. Among females only, later-borns (N = 295) were more likely 
than first-borns (N = 107) to be Sensing (x? = 46,2 < 05) and 
Judging ( = 6.1, p < 01) types. When father’s occupations were 
classified as Unskilled (N = 240), Skilled (N = 332) or Professional 
(N = 186), a clear ordering of subjects on the Sensation-Intuition 
dimension was observed. Sensing types were most frequent among 
children of unskilled workers (71%), next among children of skilled 
workers (64%), and least frequent among children of professionals 
(56%). The chi-square of 11.07 (2df) is significant at the 01 level. 
Discussion 

The central findings of this study—dramatic ethnic differences 
in the distribution of personality types, and the apparent preemp- 
tiveness of Sensing and Judging orientations among Negro college 
students—demand the most serious efforts toward interpreting 
their significance and implications. What, in terms of theory, does 
the Sensing-Judging orientation imply? How general is this phe- 
nomenon? What are the implications for personality dynamics, for 
etiology? How might these findings bear upon experiences of Negro 
students who will be attending college in vastly increased numbers? 

From the standpoint of Jungian theory, as reflected in the ra- 
tionale of the Myers-Briggs Type Indicator, personality types in 
which sensing and judging orientations predominate are those for 
whom “facts” are more potent than “possibilities,” and for whom 
drawing-conclusions is more important than “staying loose.” In 


the context of a college sample, the predominance of sensing- 


648 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


judging orientations suggests a degree of concreteness and need- 
for-closure diametrically opposed to the imagination and openness 
of the “idealized” liberal arts undergraduate. When one considers, 
further, that the counterparts of sensing and judging—intuition 
and perception—are regularly associated with academic achieve- 
ment, with creativity, with investment in intellectual and innova- 
tive projects (as derived from studies of white samples—Myers, 
1962), the problem of interpreting the predominance of sensing- 
judging orientations among Negro college students becomes espe- 
cially demanding. Г 

Interpretation brings up a vexing problem which recurs іп many 
comparative studies: how are race, ethnicity, and socioeconomic 
status to be distinguished as contributing to the findings? Is it even 
possible to do so? The traditional approach of “controlling” so- 
cioeconomic status by matching groups on indices of income, occu- 
pation, and education is of limited value in Negro-white compari- 
sons for several reasons. Because of historical and contemporary 
restrictions of opportunities for Negroes, traditional measures of 
socioeconomic status cannot describe comparable distributions 
among black and white samples. Moreover, even if it were possible 
to match samples on income, occupation, and education, the mean- 
ings of these variables could not be comparable in the context of 
such widely divergent backgrounds, prospects, and identifications. 
At least through the current generation of black students, ethnicity 
seems to be inextricably bound up not only with “race,” but also 
with derivatives of socioeconomic status. 

To what extent are demographic variables involved in the dis- 
tinctive ethnic differences found here? Although both student 
groups attend private liberal arts colleges, several differences be- 
tween Howard University and “Туу League” liberal arts students 
should be examined as possibly contributing to the present findings. 
The liberal arts colleges of the standardization studies are white 
schools, and they are predominantly upper-middle-class "elite" 
residential colleges, while Howard University is a black school, 
predominantly lower-middle-class, and largely a “commuter col- 
lege." 

If lower-middle-class values and life styles are involved in the 
distinctive type patterns found in the present study, the Howard 
University sample should more closely approximate type distri- 
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butions found among vocationally oriented students and among 
middle-class occupational groups. Some evidence derived from 
Myers (1962) studies supports this interpretation. Howard students 
are closer to modal types of business students, public school teach- 
ers, and (among males) military academy students than to Ivy 
League liberal arts undergraduates. However, the social class 
variable does not fully explain the over-emphasis upon sensing 
and judging functions among our Negro college students. When the 
Howard sample was compared with a sample of lower-division 
white students in a California state college (cf. Carlson and Levy, 
1968), highly significant differences (p < 001) in proportions 
of SJ types were found. While the California “commuter college” 
sample was predominantly lower-middle-class (white-collar and 
skilled workers), the white students showed greater diversity of 
type patterns, and greater incidence of intuitive and perceptive 
types. 

Toward exploration of the question—Why are there so few in- 
tuitive and perceptive types among Negro students?—more funda- 
mental issues need to be engaged in future research. It seems likely 
that the experience of living in & “majority”—dominated world, 
where attention to jmmediate and concrete details is necessary for 
both survival and achievement, would impose massive constraints 
upon the development of "innate" preferences for intuitive, per- 
ceptive modes of experience. When such constraints are embedded 
in an entire subculture—as а vast sociological literature suggests— 
the task of untangling individual predispositions, primary group 
norms, and general social forces is à formidable one. 

Do intuitive and perceptive functions remain relatively unde- 
veloped in the life situations of Negro students? Hedegard and 


Brown (1969), reporting studies of Negro and white freshmen at 


the University of Michigan, suggest this may be 50: uo. . Negro 
students indicated more often than white students that they had 
been encouraged to deal with the world as а set of concrete, tangi- 
ble entities, rather than to develop abstract, intellectual ways of 
conceptualizing their environment. (р. 135(2 

Or are intuitive-perceptive preferences concealed behind а 
“mask” of socially appropriate responding on self-report inven- 
tories? Hedegard and Brown (1969) specifically rejected & social- 
desirability response set as accounting for ethnic differences found 
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among their college freshmen. However, a more subtle kind of 
“response set” may be involved; a personality test may elicit a 
stereotypic presentation of the self by Negro college students. 
This possibility points to a substantive issue rather than to method- 
ological artifact, and must ultimately lead to a serious examination 
of the formation and functioning of the self-image which mediates 
response to personality tests. 

Considerable evidence (Dreger and Miller, 1968; Erikson, 1966; 
Proshansky and Newton, 1968) suggests that the American Negro, 
in arriving at a self-image capable of integrating social realities, 
necessarily defines (and constricts) his self-concept in terms of 
“majority” definitions of personal effectiveness and worth. In the 
case of non-white Americans, such “majority” definitions have 
offered only a partial and distorted self-image in which elements 
of a “negative identity” (Erikson, 1966, p. 237) are emphasized to 
some degree. Thus many significant positive aspects of experience 
are likely to be excluded from the Negro’s conception of self. These 
could only emerge as valued and integrating qualities when 
radical transformations of identity (cf. Malcolm X, 1965; Cleaver, 
1968) become possible through opportunities for validating im- 
portant “excluded” aspects of the self. Our present interpretation 
may be an historically limited formulation, since it does not take 
into account current radical transformations of identity symbol- 
ized by “Black is Beautiful.” 

This interpretation of the predominance of sensing and judging 
orientations among Negro students as a result of pervasive social 
constraints may also illumine ethnic differences in sex-typing. For 
the stereotypic “American” —extraverted, concrete, decisive, objec- 
tive, an ESTJ—is a "masculine" pattern. Males within both Negro 
and white samples were more likely to be Е, S, T, and J’s than were 
females. However, in a portion of this pattern, Negro females 
were “more so” than the white males—and Negro males “still more 
so.” Such findings are consonant with other evidence (e.g, Carl- 
son, 1969; Carlson and Levy, 1970; Lott and Lott, 1963) sug- 
gesting that the “usual” patterns of sex-typing may not hold up 
with non-white samples, or may, in fact, be found in exaggerated 
form. Further, it appears that the practical, responsible, dominant 
Negro woman portrayed in recent sociological literature can be 
discerned in the type patterns of Negro college women. However, 
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the masculine counterpart—the frustrated, impulsive, ‘Grresponsi- 
ble" Negro male of popular literature—is replaced among Negro 
college males more often by the obverse: & eautious, rational, 
emotionally constricted pattern in which the masculine stereotype 
appears intensified. 

Questions posed by the present normative findings suggest the 
need for broader and deeper inquiry into problems of identity- 
formation and self-image than is afforded by the present research 
literature. This period of rapid social change, with new and vivid 
self-consciousness among black students and citizens, and expand- 
ing opportunities for “mainstream” participation of Negroes, pro- 
vides an important occasion for studying effects of social factors 
on individual personality dynamics. To meet the challenge of this 
research opportunity, some revision of research methods may well 
be required. There is ample evidence of narrow, parochial bias in 
current research methods. For example, the pulk of inquiry on 
Negro personality has dealt with lower-class ghetto residents, and, 
perhaps more significantly, has been defined in terms of “prob- 
lems" (whether “racial awareness,” “achievement,” “adjustment,” 
etc.) seen from the standpoint of the majority-group researcher 
(cf. Proshansky and Newton, 1968). Apart from issues of ethnic 
differences, current research definitions of self-conception and ego 
strength in implicitly evaluative terms (e.g "competence," 
"achievement," &nner-directedness") neglect, or even obscure 1m- 
portant qualitative aspects of self-conception (Carlson, 1970; Gu- 
тїп, et al. 1969). у 

Previous research (Carlson and Levy, 1968) suggests that vari- 
ables of Jungian typology, 88 assessed by the Myers-Briggs Type 
Indicator, are significantly related to qualitative aspects of self- 
conception in white students. Whether this instrument, with its 
reliance upon verbalizations of self-conceptions, is fully adequate 
in capturing variables of the Jungian typology among ethnic mi- 
norities should be examined jn future research. It seems possible, 
for example, that such manifestations of intuitive and perceptive 
modes as the rapid assessment of implications of social situations, 
or the development of creative, informal “languages” and respon- 
siveness to such languages may be missed by self-report inventories 
currently available. However, further development of Jung's typo- 
logical approach, and extension of assessment methods to include 
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other manifestations of basic Jungian attitudes and functions, 
should enrich cross-ethnic studies of personality. 

Beyond the substantive findings concerning Negro-white person- 
ality differences, the results of this study give considerable support 
for the use of the Myers-Briggs Type Indicator as a psycho- 
metrically stable instrument capable of reflecting important group 
differences. Present findings suggest that the dimensions of this 
instrument are more stable than indicated by previous research, 
and provide presently unique data suggesting that qualitative type 
designations are also remarkably stable. Further work directed 
toward extending construct validation of the Jungian typology 
and of the Myers-Briggs Type Indicator would seem highly prom- 
ising and worthwhile. 
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A MULTITRAIT-MULTIMETHOD MODEL 
FOR STUDYING GROWTH: 


CHARLES E. WERTS, KARL С. JÓRESKOG, лхо ROBERT L. LINN 
Educational Testing Service 


Werts and Linn (19702) have suggested that & multitrait- 
multimethod approach (Campbell and Fiske, 1959) might be used 
for studying growth. The purpose of this paper is to detail such а 
model and to outline implications for the study of growth. The 
major focus of our exposition will be the logic of this model rather 
than the estimation of parameters ог testing the fit of the model to 
data. A comprehensive discussion of appropriate estimation and 
fit-testing procedures may be found in Jóreskog (19708), whose 
general model for the analysis of covariance structures subsumes 
the models used in this paper. 


The Model 


The multitrait-multimethod approach may be treated as & prob- 
lem in confirmatory factor analysis (Jóreskog, 1970a, 1971). For 
illustrative purposes we will consider the example of three traits 
and three methods since this is the minimum number of traits and 
methods required to produce unique (defined jn Jóreskog, 1969, 
pp. 185-186) parameter estimates, given the assumption that each 
observed measure loads on only one trait and one method factor 
and all factors are oblique. The general factor analytic model is: 


у=» РАТ +е ( 


1'The research reported herein was performed pursuant to Grant No. OEG- 
2-100033(509) with the United States Department of Health, Education, and 
Welfare and the Office of Education. 
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where у із the vector of observed scores, 

is the mean vector of y, 

is a matrix of factor loadings, 

is a vector of common factor scores, and | 
is a vector of unique factor scores corresponding to specific 
factors and/or errors of measurement. 1 


oneg 


For our example: 


У' = (уп, Уз Уз, Из) Yan, Yans Ул» Уаз, as) (1a) 
where in y,; i = method and j = trait, 
T = (T, Ta, T, My, Ma, Му, (1b) - 
where 


T;=the jth trait factor, 
: M,=the ith method factor, 


AOT RM 0 

Дно JO ОВ. 0 

AION Жорук узу: Ва 

(О ВО 0 1 
= ОАО ОВ: 0 (10) 

UEM ОО By 

ОНО НА Bid. о 

0 Au SB. .0 

0 ARNO TO ва 


where 
А, ; are loadings on trait factors and 
В; are loadings on method factors, 
The expected variance-covariance matrix X of y is then given by 
E = AGA! + 6? @ 


where 6” is a diagonal matrix whose elements are the variances ofi 2 
Since all factors are oblique, in our example: - 
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Vr, Symmetric 
Ст.т, Ут, 
Sim Сул, Cnr Vr, (2а) 


Cru. Cran Стм Ум: 
Сума Сума Сума Сим, Ум, 
Crm. Crue Crus Сы, м, Cums Ум, 


where the C's are covariances and the V’s are variances. 

Following Jóreskog (19702), parameters will be labelled as one of 
three kinds: (1) fixed parameters that have been assigned given 
values; (2) constrained parameters that are unknown but equal to 
one or more other parameters; and (3) free parameters that are 
unknown and not constrained to be equal to any other parameter. 
The term “identifiable” will be used in the sense defined by Fisher 
(1966, р. 25): “we shall speak of that equation as identifiable (or 
identified) if there exists some combination of prior and posterior 
information which will enable us to distinguish its parameters from 
those of any other equation in the same form.” For the models 
studied in this paper, the term *üdentifiable" is synonymous with 
the factor analyst’s term “unique solution,” i.e., а solution is “unique” 
if all linear transformations of the factors that leave the fixed pa- 
rameters unchanged also leave the free parameters unchanged. As 
Jóreskog (1970b) notes: “Before an attempt is made to estimate 
a model of this kind, the identification problem must be examined.” 
The number of overidentifying restrictions on the model is frequently 
of interest, for example, after standardizing factor variances (1.е., 
Vr, = Ум, = 1) the three method by three trait model has three 


overidentifying restrictions, ће, 2 has 45 distinct variances and 
meters to be estimated 


covariances as compared to 42 free рага 
(18 factor loadings, 15 factor covariances in $, and nine residual 
variances in 0). The number of overidentifying restrictions is the 
degrees of freedom (df) for the test statistic in Joreskog’s general 
model (1970a, p. 241, sec. 1.4). The “path analysis” approach used 
by Werts and Linn (1970a) can be very useful in exploring the 
identification question in overidentified models. However, as noted 
by Hauser and Goldberger (1970) the “path analysis” literature 
does not adequately deal with the estimation problem in over- 
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identified models, in part because the sample-population distinction 
is blurred. Ј 
The multitrait-multimethod approach considered above does not 
consider any functional relationships among the trait factors, 
ie. the approach deals only with errors of measurement. In the | 
study of growth, these trait factors correspond to initial status, - 
final status, and the determinants of growth and a structural model. 
showing the relationship among these variables must be specified. 
Substantive inferenees about growth are based on estimates of the 
parameters of the structural model. | 
Suppose that the struetural model for growth took the form: 


T, = В.Т, + 0,7, + 4) 
where Ts is the final status, То is the initial status, and Т, is a 
determinant of growth; all other influences on growth (represented 
by &) being independent of 7; and Т». In this model the initial 
status T, may influence the rate of growth. The parameters 
equation (3) are just identifiable in terms of the elements of Ф, i. 
the number of restrictions on the overall model is not change 
Assuming that, Та and T; are measurements on the same dimensi 
as implied by the terms "initial" and “final” status, growth (A) 
equal to Та — Т». Werts and Linn (1970b) have shown that 
regression weights for T, and T; are: 


D, = Прато 
апа 


Dı =1+ Dar,.r, 


where D, т, .т, is the regression weight of А on T, with Т, controlled 
and Дат,.т, is the regression weight of A on Т, with T, controle 
In other words D, represents the direct influence of Т, on growth а 
D; represents the direct influence of initial status on growth р 
unity (which represents that part of Т, which is initial stat 
Since T, = A + Ta, substituting equations (4) and (5) into ( 
yields: | 
А = Юхлт,.т,Т + Пат, т,Та + E. 


In terms of Tı, Т» and ё, equations (1b), (Ле), and (2a) become: 
ae = (ТТ, 5 Mi Му Мз), 
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Ay 0 O "Ви OTO 
LEER O 0 0 Во 
Anu 0 0 0 Ba 
0 Ay 0 OF Ви £0.00 
A xD Aes бу 501 ВАО (Tb) 
0 As 0 0700: во де Bes 


А.р, Азра Аз В. 0 0 

Ар, Азр. Аз 0 В» 0 

Азр, Азра А» 0 0 Ва 
апа 


1 Symmetric 
Сат 1 

0 0 у; 

Cra Сим, Cun 1 

См, ти, би, Сим 1 
Сума Ст,м, Съм. Сы, м, Сим, 1 


respectively. If the analyst wished to scale a factor by the unit of a 
particular measure this may be accomplished by setting the Av 
slope for the measure equal to unity (in which case the variance of 
the corresponding factor should not be standardized but left free 
to be estimated by the program). The assumption that Та and Т» 
are measures on the same dimension is equivalent to setting the 
same method regression weights equal, i.e., in our example Ала = Ai» 
Ан = Аш and А = As. As detailed by Werte and Linn (19709) 
the effect of these restrictions is that the ratio of the variance of 
Т, to T, is fixed. For estimation purposes it is convenient to stand- 
ardize all factors except Ts whose variance is fixed in relation to 
Т, The model defined by equations (7a), (7b), and (7c) is no longer 
a simple factor analysis model, but may be estimated using Jéreskog’s 
(19702) general model for the analysis of covariance Structures. 
For this purpose A* may be rewritten as the product of two matrices: 


, (79) 


A* = BA** 
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where 
100 000 0 0 0 
0. 1050. OSGeo о | 
0 0 1 000 0 0 0 
О Ово Ооа 0:0 0 
B =0 00 010 о 0 0 
0 07071001 0 0 0 
000: 0200 Аз 0 0 
0100040501090; A 0 
ООО ОВО 0 поз A] 
апа 
4: 0 ово 07 
An PO B 0 
Аз 07050 Bey 
0 А, 0 В, 0 0 
AN = 0 An 00 By 0 
0 4400 0 B, 
D D 1 xs 0 0 
DEDERIS 0 
D D 10 Q0 j| 


and ха = Bys/A1s, ха = Bo/ Axa, ха = Bs,/As,. By substitution: 
I = ВА**Ф'А В’ + 0°, 


which is a special case of Jóreskog's (19702) general model. 

In using the computer program (Jóreskog, Gruvaeus, and van 
Thillo, 1970) the parameters Ал, Аз, Аз, іп A** should be con- 
strained to be equal to Ал, Аз, and Ал respectively їп В. The 
resulting model has 45 distinct variances and covariances in E and 
40 free and constrained parameters (17 in A**, 14 in @*, 9 in 4 
none in B because of equality restraints) which means that the 
model has five overidentifying restrictions (df). The advantage of. 
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casting the analysis in terms of Jéreskog’s general model is that, 
given the assumption that the observed variables are distributed 
normally, various hypotheses about the model may be tested in 
large samples. In particular, we may wonder if trait factors are 
uncorrelated with methods factors and methods factors with each 
other as assumed by Cronbach and Furby (1970) and Werts and 
Linn (1970a) in their analysis of growth. To make this test, the 
analysis would be run with the model of (1a), (1b), and (1с), and 
(2a) with Vr, = Vr, = Vr, = Ум, = Ум, = Ум, = Land then the 
analysis would be made with Ст,м, = Cru, = бтм, = Crm = 
Cran = Сри, = Сри, = Сим, = бтм, = бшм, = См, м, 
Сы.м, = 0. For our example, the initial analysis would yield a 
chi-square with three df for testing the fit of the model to the data. 
The second analysis would yield a chi-square with 15 df since 12 
additional restrictions have been made. The increase in chi-square 
with 12 df is a test of the tenability of the additional restrictions. 
Starting with the same initial model, the tenability of assuming that 
Aja = Ais Ал = Aas, and Аз = Азз may be tested (dropping the 
Vr, = 1 assumption) using the increase in chi-square with 2 df. 
Likewise starting with these assumptions (1.е., equations (78), (7b), 
and (7c), and df = 5) hypotheses about growth can be tested, e£ 
D, can be set equal to zero and the resulting change in x° (df = 1) 
is а test of whether T, directly influences growth. To test whether 
initial status directly influences growth (i.e., whether Da Ten = 0), 
D, would be set equal to unity (see equation (4)), the increase in 
х? (df = 1) testing this hypothesis. The fit of the observed variance- 
covariance matrix 8 to the estimated elements of = may be used to 
form some judgment as to changes in fit resulting from additional 
restrictions, especially when the x’ test is inappropriate because 
the assumption of multivariate normality is not reasonable. 

As originally conceived by Campbell and Fiske (1959) the 
multitrait-multimethod approach required each trait to be mea- 
sured with each method, as in the example analyzed above. The 
linear structural model approach proposed herein requires that 
model parameters be identifiable, a question which is unrelated to 
whether each trait is measured with each method. In order to fix 
the ratio of the variance of the final status to the initial status 
factor, only one pair of initial and final measures with the same 
Units of measurement are required, i.e., the three sets of initial- 


= 
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final measures in our example serve to overidentify this vari 
ratio. The identification problem would be greatly simplified if 
of these same method sets were replaced with different m 
measures, even though the resulting matrix would no longer be ii 
the form required by Campbell and Fiske. Campbell and Fis 
argument that different method measures of a trait are required t 
improve convergent validity appears fundamentally sound and is 
basic premise in our analysis. We have abandoned the particula 
type of analysis used by Campbell and Fiske because it fails to 
specify the underlying structure being postulated, and does nof 
low for nonsymmetrical method-by-trait combinations. 


Relationship to Classical Test Theory 


The multitrait-multimethod formulation can be shown to inc 
various procedures derived from classical test theory as speci 
cases, e.g., the commonly used formulas for reliability of differen 
correlation of true initial status with true gain, and the correla’ 
of true scores over time can be derived from the multitrait-multi 
method model by imposing specifiable restrictions. To illustrat 
this point we shall examine the case of two parallel measures (9: 
Vo») given initially and two finally (узз, Yea). First let us consi 
the analysis given the traditional assumptions that all errors 0 
measurement are independent of each other and of the true scor e 
In our formulation this is equivalent to asserting that there are п 
methods factors. Without further assumptions the model may b 
represented in terms of equation (1) as 


У = (уш, да Улз, Уза), 
Т = (Ta, Ta), 
Ал 0 
Аз 0 
д, 
0 A 


qu M |: 
Ст. т. Vr, 


A= 
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and 


Yos : (ве) 


0 104 + 


Assuming that initial and final status are on the same scale, parallel" 
test assumptions are equivalent to (Jóreskog, 1971) fixing Ais = 
Ан = Ав = Аз = 1 and constraining у, = V, and У, = Vii 
, All parameters are identifiable and df — 5. Identification still occurs 
without the error variance assumptions (df = 3), ће. in true score 
lexicon, “essentially tau-equivalent” measures (Lord and Novick, 
1968, pp. 47-50) would suffice. If we choose to use nonparallel or 
"congeneric" (Jóreskog, 1971) measures, one pair of measures over 
time being on the same scale (e.g, Аз = Ам), Ут, could be arbi- 
trarily standardized (= 1), yielding an identifiable model with 
df = 1. In all these cases, growth statistics may be obtained from 
the parameter estimates or the model can be transformed to obtain 
. growth statistics directly. Inserting Ts = Т, + A then: 


Т* = (Ts, А), (9а) 
Аз 0 
Дуе А (9b) 
Ais Аз 
' 
j An An 
(0 Where Ai, = Ав by assumption, and 
d | | (99) 
Cra Va 


where Ух, = 1 for convenience. 
Relevant growth statistics are: 
Вала = correlation of initial status with gain = ба V ie 
а; 
Di. c Ope E Р (10b) 
(100) 


== 


| Vr = У, + Vat 20243 
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and 
Drs = 1+ i. 


model of equations (8a), (8b), (8c), (8d), and (8e), growth s 
tistics can be obtained by: 


Dr, = дах, ts Vz, 
Dar = Бул, 5 

Va = ee Gh Vr, E 26r. 
(UN - Dari Ра, 


Bra = Dar, ۷ Pri + Pa. 


Following Jöreskog (1971) the parallel test assumption can be 
(given multivariate normality) by comparing the chi-square for 
“essentially tau-equivalent” model to that for the “parallel” test. 
model; the difference in chi-square with df = 2 is a test of assump 

tions that У„, = V.,, and V, = V.. Similarly the increase in 
chi-square from the “congeneric” model to the “essentially 
equivalent” model (df = 2) is a test of the assumptions that Ала = 
Аз and Аз = An. If the parallel test ‘assumptions are accepted 
then the population reliability at the initial time may be estimated 
by Pr, + (Pr, + Ў...) and reliability at the final time by Pr, 
(Pr, + 7,,). The reliability for each test is the square of the ci 
responding standardized factor loading in the case of “essenti 
tau-equivalent” or “congeneric” measures. Another statistic 
interest in the traditional psychometric literature is the reliability 
of differences (pa) which is defined as the true variance of the d 
ferences divided by the variance of the observed differences. In 


parallel case the estimated population error variances can be 
to obtain A, directly: 


and 


I Y. 
^ EX тим 


With "essentially tau-equivalent” assumptions no statement i 


made about equality of error variances so that four reliabilities mai 
be estimated: 


gj - RAD 
Vat Y. ЕТ...’ 


LI 
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Va 


pa” = и ИЕ x 7. a7 7_ 5 (120) 


91 = T 
Ba = y У, + Vow sigs 


(12e) 


81? = „осо ас ET 

Vat Ӯ... + Ў... 
Formulas (12а), (12b), (12), (124), апа (12е) ате based оп the 
assumption that the true scores have the same units as the observed 
scores, which is not true in the case of congeneric measures. Since 
the regression of observed on true differences is equal to the regres- 
sion of observed on true scores (Werts and Linn, 1970a, equation 
[25]) it is only necessary to standardize this weight with the ap- 
propriate variances to obtain the reliability of differences for all 
cases, e.g., in the congeneric case if Аг = Ала then 


7. 
et 2 4 
bs = Au FSV, — Ca Vo) 


where V,,,, Ту, and Сб Yrs) are the estimated elements in $. 
This formula uses estimated elements in $ which are provided in 
the computer output for Jóreskog's program (Jóreskog, Gruvaeus, 
and van Thillo, 1970). The program computes the elements in £ 
from the estimates for the underlying parameters, e.g, Ó(yi Уз) = 
А.А „От,т,. This model (all measurement errors independent) may 
be used to clarify traditional procedures for obtaining growth sta- 
tistics, For example, consider the case in which one initial and one 
final test is given. A common procedure is to obtain split half reli- 
abilities at each time and use these to correct for attenuation. If 
Vi? and уа are the initial split halves and yis and уз the final split 
halves, this case corresponds exactly to the parallel measure case 
analyzed above. The difference from the traditional procedure 18 
that the complete variance-covariance matrix for the split halves 
is computed and used in the analysis. As shown above, the “parallel” 
and “essentially tau-equivalent” assumptions can be tested against 
the congeneric model and the congeneric model is overidentified. 
From this perspective the traditional procedure neglects useful 
information about correlations among split halves and thereby 
loses the possibility of rejecting the model because of poor fit to the 
data and of analyzing the data making only congenerie test as- 


(121) 
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sumptions. To understand the connection with the traditional 
formula it is of interest to standardize € into a correlation ma 
(correlations generated by the model are indicated by symbol p 


and to show the relationships to standardized model parameters: 
(denoted by asterisk): 


Pa, Уз) = Ån hrn 13* 
Pha Yas) = Амш*йт,т,А»* 
Pss, Уз) = As br, st 
Ра, Уз) = Ast hr n, Ân* 
Pha, Уз) = Ân*Ân* 
Plis, as) = Au* Ast. 


If parallel test assumptions are valid then 4,,* = A,,* and А“ = 
Аз", in which сазе equations (13а), (13b), (13c), and (13d) 
identical and should be recognized as the traditional correction fo 


than from the observed correlation matrix S. Equations (13е) and 
(13f), under parallel test assumptions, are simply the assumption 
that the reliability defined as the squared correlation (ie., Ала" || 
or Ал“) of the observed with the true score is equal to the cor 
relation between two parallel tests, but again the correlations are 
drawn from $ not from S. What these equations show is that it i$ 
not necessary for the reliabilities of the split halves to be equal im 
order to identify the unattenuated correlation ûr, r, given unco 
related errors. If the estimates of the elements in Ê for the parallel 
case are examined it will be found that because of the structural 
specifications: V,,, = P, Y, = Ӯ,.., Oia, уз) = (у, из) 
Оу, Уз) = C(yas, 23), С, ya) = ЕЯ б (via, Yor) = Ут, ani 


Wa, Vis) = On уз) = Gren Translating the equation for the 
reliability of differences into the elements of $; | 


ы Оэ; 3/23) zh Об, Yos) — 2000.2, лз) 
Pa Yu s Y (р 2002, Via) 
or 


PCS LASCIO Ya) + Я, (ia, Yaa) = 26», yi) У P. Ти LA 
7... 8 Ка; 7% 2202, 113) V Рака. 
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Equation (145) should be recognized аз the traditional formula for 
. the reliability of differences, noting however that the estimates 
are drawn from 2, not from the observed matrix 5. The essentially 
tau-equivalent case differs from the parallel case in that the cor- 
responding variances in $ are not required to be equal, however 
the covariances between independent measures of different traits 
are still equal to the covariances between the corresponding traits 
factors. This means that formula (14a) could be used for any pair of 
tau-equivalent tests over time. For congenerie measures the formula 
involves the pairs of measures which have the same units over time, 
e.g., if Аз = Ais then equation (12f) may be translated into 

Y, (Aa + Р (Aa) — 24s Auth У Тил но, (140) 

Ӯ. + Ра, — 280i Yas) У 


Equation (142) is the reliability of differences formula given by 
Werts and Linn (1970a, equation [26]) for the case of correlated 
errors over time for the pair of measurements on the same scale, 
i.e., the Werts and Linn formula is also appropriate to the independent 
error case when applied to the elements of $ rather than S. If formula 
(14c) applies to correlated errors using congeneric measures then it 
may be specialized for the parallel measures case, e. if yi, and yis 
have nonindependent errors and уз and ya have independent errors: 
(а) A,,* = Ала“, by parallel test assumptions, therefore Ала Ала втат, 


в = 


за Au Ass rri 
(b) but Pus,» = Aut Án Prr 
Since 


Ла“ = V Bios Vas 
Ала“ = У Ais, уз), 7... = Y55 f... T Po 


equation (14c) becomes 


ћ = P, Bs, Yoo) + T's (ıs, Уз) — 20012, Yas) V Рођа, (14d) 


LESS 7... — 2202, ths) У Каћа 


wae 
Equation (14d) is the formula for the reliability of differences for 


“linked” (ће. correlated errors) parallel test measures given by 
Cronbach and Furby (1970, equation [6]), which can be seen to be 
the parallel measure specialization of the Werts-Linn equation for 
nonindependent congenerie measures. Similarly from equations 
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(11a), (11b), (110), (11d), and (11e) it follows that the estimate 
correlation of status with gain is: 


б. Ў, 
а = . 1 
te Бү жит улс артур $ 


In the congeneric case with Ay = Аја, this may be transformed 
into 
m Aut VT, АУТ 
Aw" Pein st: Au" f, d 20s, v As! Ass" V T, АИ 4 
(15b 
Formula (15b) is the correlation of status with gain given by Werts 
and Linn (1970a, equation [28]) for the case of congeneric measure 
and correlated errors, i.e., the formula applies also to the independent 
error саве, In the case of parallel independent measures йг, т, 
Ру» уз) + Ма, Vas) Gis, Yon) Which when substituted into 
formula (15b) yields the traditional formula for the correlation 6 
status with gain as applied to the elements of $: 


_In this paragraph we propose to use our model to specify the con: 
ditions implicit in Cronbach's (1960, pp. 136-139) discussion € 
coefficients of “stability” and “equivalence.” Cronbach uses ап 
ample in which two forms of the Mechanical Reasoning Test 
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the DAT were used, the same forms being used for test and retest 
purposes. When the same form is repeated, the test-retest correla- 
tion is higher than the test-retest correlation between different 
forms, suggesting the presence of "long-lasting test-specifie" fac- 
tors. The implication is that the errors of measurement for the same 
test repeated are not independent. Assuming that both forms were 


repeated and errors of measurement independent for different forms, 
the model for parallel measures is of the form: 


у =и+АТ, . (16а) 
where 
у = (уи, Yar Y» Из) (16b) 
where уза and ула are the same test as are уга and ys. 
Т = (Ta, Ts, б» ёз, ёз» си), (1бе) 
101000 
а tea ва) 
* 010010 
010001 
and 
Vr, 
Crr Vr Symmetric 
emily у... ‚ (69 


0 
0 
0 O 05. Va, 
0 0 0 O Von 


where V,,, = У, Vas = Vow 

The model of (16a) is the special case of factor analysis in which 
the residual factors are treated as latent factors. of ® 
бето that the sna tash СС ant 

„Ср and Сри O: АП parameters are identifiable m 
(10 distinct elements in X less 7 free and constrained parameters). 
Essentially tau-equivalent assumptions would still have provided 
identification but with only one overidentifying restriction (since 
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Vass Æ Ve.» У... 7 Vena). An interesting case occurs with congeneric 
assumptions in which case the model is underidentified; however, 
the unattenuated trait correlation pr,r, is just identified [f7,7,’ = 
б(улм, Vas) C (Jon, Yrs) + O(a, уз) (лз, Уз]. Identification may ђе 
achieved with the congeneric model by repeating only one test 
(assuming А, = Ала) and using different method measures for ya 


and y, in which case the model is: 

An 0 1000 

An 0 0100 : (162) 
0 As 0010 

0 A, 000 1 


where Ала = А, by assumption, and 


A= 


1 

Ст.т. Vr, Symmetric 
шеи, . on 

0 0 Vege 


0 
ОИ СӨ Y... 
мото UO Ta: 


This model is just identified (10 distinct elements in X less 10 pa- 
rameters to be estimated). Let us return to Cronbach’s example 
in which there are Forms А (ул) and B (узг) initially and retests 
on Forms А (уз) and В (ул) three years later. Cronbach partitions 
the variance using the immediate and retest correlations among 
forms (assumed parallel) which in our model corresponds to the 
elements of X. We may translate Cronbach’s partitioning procedure 
into functions of the model parameters in equations (16а), (165), 
(16c), (16d), and (16e) as follows: 

1. "Lasting General Variance” = (yi, уз) = Ан*р(Ть Ts) Aas" 
which according to the model equals p(ya» yı) = А" Та 
ТА а“. 

2. “Temporary General Variance” = „(у уз) — p(y, ym) = 
Au*Án* — Алур(Т,, Т,)Ам“ which according to the m 
equals р(Ул», ya) — р(уг2, Улз) = Ais Ass* — Ар (Та, ТА" 
In principle there is a different “Temporary General Variance” fot 


irs 


J 


P 
Aa (Ts, Тз)А 1з*. 


Plia Улз) — р(уэз› 


“Lasting Specific Variance” for Form 
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the end time p(yis Уз) = р(улз, Уз) = AAs — Ар (Та 
ТА“ which equals p(yis Уз) — р(уз» Уз) = Ав*Ав* — 


А р(уль Уз) — pn ya) = 


уз) = VI = (An*) plesn ёз) V 1 — (Ais*) 


and for Form В p(yas yas) — Pio Jas) 


V1 — (Ав*) plens, ez) V 1 — (Ал) 


4. “Temporary Specific Variance” [1 — 


ply» уз] = == pia 
To (А pleis $i) V 1 — (Ais 


As Aza” [5‏ ج 


correlations used by Cronbach, but 


other temporary specific v 


= р(022 Уз) — puso, Yas) = 


рй» уз] — lois yi) = 


уз] — [00:2 yi) — Der уз] = 


ж)? for the 
in principle there are three 


ariances 1 m Ara* Aza” E Ties (А»*) 


У (Aat Че AA — м1 = (Au) 1 


*p(es2, ёз) V 1 = 


Plein, бо) V 1 — (Ais*) > and 1 — 


plena, ёз) V 1 — (dat). 
It can be seen that Cronbach’s procedure for partitioning of vari- 


ance involves complicated functions of 
analyze observed correlations in terms of a set 


only is it simpler to 


of structural parameters, but it allows fi 


models. Further light can be shi 


the model of (16a), 


An*An* = vi- (Азз*) 1 


the model parameters. Not 


or analysis of overidentified 


ed on the assumptions implicit in 


(16b), (162), (164), and (16e) by asking what 


(Mz) for Form B the model becomes: 


where 


variables account for the correlated errors. Assuming that & single 
factor (M) underlies the correlation for Form А and another factor 


у= РАТ не, (17а) 

y = (уз Yan Yrs zo) (175) 

т = (0,17, Mi, М), (170) 

e! = (eus ر‎ 6 02); (170) 
10 Ва 0 

Pe O PONT (176) 
0 1 Вз 0 
010 В 
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and 


Vr, 
Dim Ст. Vr, 
0 0 1 


Analysis of the identification problem shows that B,» Bas, Bis 
and В» are not separately identifiable; only the products (B, 
and (В.В) are identified. This means that in Jóreskog's pro 


the estimation for other parameters. Assuming Ва = Ва and 
Ва = Ва, this model is a simple transformation of (16а), (165), 
(16е), (16d), and (16е) under essentially tau-equivalent assumptions, 
that is, У, # Ve... Ve. У, in equation (16е). In particular i 
can be seen that it must be assumed that М, and М, are uncol 
related. In the usual case where the parallel tests are very similar 
methods, this would appear to be а very dubious assumption. Tt is 
possible to deal with oblique true and method factors but usually 
more different method measures are required as in our 3 trait X 3 
method example in Section I. 

When methods of measuring а trait are made as different as 
possible, it is usually the case that the units of measurement are 
different, which means that congeneric rather than essentially tau- 
equivalent or parallel assumptions are appropriate. Werts and 
Linn (1970a) consider growth models based on congeneric mea- 
Sures, e.g., іп one case they use three congeneric measures of T; and 
two congeneric measures of Ts, allowing for same test correla ed 
errors over time. This model is overidentified, but no attempt was 


made to deal with this complication, Phrasing this problem in terms” 
of Jéreskog’s general model: 


Y=u+AT+e 


У = (02, уз, Иза Vias Узз) 
where уг and у1з are linked as are Уго and Yag. 


Т = (Ta Ta, M,, М.) 
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Ay 0 By 0 
An 0 0 В. 

A =| Ags 0 блуд (18d) 
0 As Bs 0 
0 An 0 Ba 
1 

өрө Yn | no 
0 ОА 


0 0 Qo 


Assuming that Ay = Аш Аз = As, and for convenience that 
By = By, Ва = Bas, this model has four overidentifying restrictions 
(15 distinct elements in X less 11 parameters to be estimated). 
Werts and Linn give two formulas (1970a, р. 198, equations [28] 
and [29]) for estimating the correlation of status with gain involving 
observed correlations and variances whereas Jéreskog’s approach 
generates a single estimate by equation (15a). In essence Werts 
< and Linn dealt with the elements of the observed variance-covariance 
matrix S which may yield inconsistent estimates of pr, Whereas 
such inconsistency cannot arise with respect to the elements in $. 
Jóreskog has an unpublished operating program for estimating factor 
scores within the confirmatory factor analysis model (Jóreskog, 1971). 
As Cronbach and Furby (1970) note, however, there is seldom 

4 need for such estimates. 

Relationship to Factor Analysis 

A common practice in the factor analysis of growth data is to 
compare standardized factor loadings at one time to the loadings 
for the same set of measures at & i 
remains constant over time the inference is drawn 
^ are measuring essentially the same dimension at 
For example we might have three measures of T, at time 1 with 
factor loadings Ала“ = .30, Asa" = -40; and Аз“ = .50 and identical 
loadings on T, when these measures are repeated at time 2, i.e; 
Ам“ = .30, Ал“ = 40, and Аз“ = .50. For heuristic purposes let 
| us suppose that the repetition of tests did not result in methods 
l^ factors and that the true variance increased from Vr, = 1.0 to 


that the factors 
different times. 
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Vr, = 1.5 over time and C7, 7, = 1.2. It may be immediately inferre 
that the error variances for all tests increased over time since tl 
test reliabilities (in this model the squared factor loadings) rema 
constant and the true variance increased. However, Wiley 
Wiley (1970) have persuasively argued that it is more likely 
error variances are a test characteristic which is likely to remi 
constant over time. If this is so, then an increase in true vari 

along the same dimension will necessarily mean that the reliabili 
of the tests will increase over time, ie., the standardized fe 
loadings will increase. In the same fashion it may be deduced 
if for any given test over time the unstandardized regression wei 
(Aia = Ал) and the error variances (V,,, = V,,,) are equal, 
in general the standardized factor loadings (А ;;*) are not proportii 
from one time to another. We conclude that comparison of st 
ardized factor loading patterns over time provides no logical b 
for any conclusions about whether pretests and posttests are 
suring the same variable. It appears to us that such an assump 
which in this model is equivalent to equality of unstanda 
regression weights over time (e.g, 4i, = Ais), is basically 
testable within the framework of this model. It would seem b 


error variance are relatively constant (over time) test character 
but to build models and gather requisite information such 
these model parameters are identified, 
| While it is not possible to test the assumption that Аја = Au 
it is quite possible for this assumption to be incompatible with 
assumption that А», = 4з. The ratio of Vr, to Vz, resulting fro 
Au = Ал may differ from the ratio resulting from Az, = Аз. TI 
may be tested by the increase in x? (df = 1) resulting from 
addition of Аз, = Ass to the model in which Ала = А.а. Within th 
framework of this model, if it is true that the corresponding 
of tests over time in fact have the same units, then the scalin 
Үт, to Vr, should be the same for each pair. i 
The finding that the data are consistent with the hypothesis: 
Аза = Аз and Аз = Azs does not necessarily imply that the uni 
measurement for the corresponding pairs of tests over time аг 
same since it is quite possible for the scaling to be erroneous і 
both pairs of tests but in the same way. If the data are inconsis 
with the hypothesis that Ај, = Ала and A23 = Azs we could condlt 
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that the units over time are not the same for both sets of tests, but 
it is still possible that the units are the same for one of the sets over 
time. Even if it could be shown that Ај) = Ais, this would only be 
evidence consistent with, not proof of, the hypothesis that the scales 
are measuring the same process over time. 


Determinants of Growth 


Werts and Linn (1970b) have considered the problem of making 
inferences about the determinants in a linear model. The Werts- 
Linn formulation was based on classical true score assumptions, 
ie. no provision was made for methods factors. For heuristic pur- 
poses let us reconsider the problem of growth determinants, formu- 
lating the three trait, three method model in terms of growth (Та 
= Т,+ A): 


T = (Ty, Т, А, My М» Мз) (192) 
Ао Bi30 0 
А РО ОНО Ba 0 
Аз 0 УПОЛА. 0 Ва 
0 Аз 0 Bu оо 
peln "A2 07 Г: 764 (19b) 
ОАО 700 0 Bs 
0 Aw Аз Ва оо 
0 An An 0 В» 0 
0 Аз Аз 0 0 Ва 
1 
Сињ 1 
Cra Cra Va . (190) 


Сам, Стам. Сам, 1 
Cras Cru, Cams Сим, 1 
Сума Сума Сам, С.м: Сы,м, 1 


It should be noted that although this formulation does not directly 
involve the parameters of the underlying growth model A = 
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Dar,.r.Tı + Дат,.т.ТГз + £, however, the regression weights are: 
Cr, де Ст, а С. 


Dann = Tr m , (194) 
апа 
Cra — C 


Traditional test theorists (e.g., Bloom, 1964; Thorndike, 1966) 
have been very concerned with and have drawn substantive in- 
ferences about the determinants of growth from the correlation of 
status with gain, usually corrected for “attenuation.” However, 
as detailed by Werts and Linn (1970b), in a linear structural model 
prime interest is in the model parameters Dar,.r, and Рат, т, since 
if either one is zero the inference will be drawn that the corresponding 
variable does not directly influence gain. Except in the case in which 
initial status is uncorrelated with all determinants of growth, 
knowledge of the correlation of status with gain, pr,s, does not 
allow us to draw inferences about model parameters. It is quite 
possible for рт, to be completely spurious due to a common ante- 
cedent influence or it is quite possible for pz,4 to be zero without 
implying that Даг, т, or Da 7,.7, be zero. For this reason we question 
Thorndike’s (1966, p. 124) interpretation: “In considerable part, 
the factors that produce gains during a specified time span appear 
to be different from those that produced the level of competence 
exhibited at the beginning of the period.” Our objection is that 
Thorndike’s conclusion was made from the correlation of status 
with gain, without specifically introducing into the analysis any 
presumed determinants of growth. In a linear structural model the 
total association of initial status with growth is an insufficient basis 
for drawing inferences about the various possible determinants of 
growth. 


Discussion 
The variety of test response tendencies covered by the rubric 
“methods factors” appear to be an almost universal complication 
in sociopsychological growth studies. Even though in principle the 
multitrait-multimethod model presented in this paper provides for 
“methods factors,” it does not follow that this model does in fac 
provide a better simulation of reality than previous models which 
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have typically ignored methods factors by assuming independent 
errors of measurement. It may be expected that our procedure will 
typically yield different parameter estimates (e.g, correlation of 
status with gain) than previous procedures, but what has been 
learned about growth and its determinants thereby? What is learned 
about reality from the overwhelming concern of the factor analyst 
with statistical fit? There is no guarantee that the best fitting 
model yields substantively meaningful results (e.g, Werts, Jóre- 
skog, and Linn, 1971). Why bother with complicated structural 
models involving unmeasured variables when it is likely that a 
simple regression equation involving only measured variables will 
provide the best prediction of the criterion? From our perspective, 
if the researcher’s basic interest is in reality, then the research must 
be designed to explore reality, i.e. to offer evidence as to which of 
the initially plausible alternative hypotheses (models) provides the 
better simulation. In some cases this may involve a study of the 
theoretical implications to see what information is necessary to 
discriminate between the alternative models. In other cases the 
study may be a continuing one as in the building of models to sim- 
ulate the national economy, in which case the ability to better pre- 
dict new yearly data is used to discriminate among models. Our 
purpose in making these remarks is to heighten the awareness of 
researchers that parameter estimates, such as the reliability of 
gain scores, are always made within the framework of a whole set 
of untested assumptions about the nature of reality. It is mislead- 
ing to talk about “the correlation of status with gain” since the 
meaning of this parameter is totally a function of the particular 
model used to derive the parameter. In most cases in which this 
type of estimate has been used, no effort has been made to examine 
the validity or even plausibility of the models underlying these 
estimates. The linear structural model presented herein is as sus- 
pect as any other model and needs to be justified as one of the plau- 
sible alternative hypotheses, prior to data analysis. 
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DOMINANCE IN MENTAL IMAGERY? 


MORRIS P. LEIBOVITZ? PERRY LONDON 
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LESLIE M. COOPER 
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JOSEPH T. HART 
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"Тноосн Galton and Charcot began the scientific study of 
imagery in 1880, little is yet known about the ability to form 
mental images. Galton observed great individual differences in 
clarity of visual images, and Charcot suggested that there are 
distinct imaging types of people, some of whom produce pre- 
| dominantly visual imagery, others mostly auditory, and still 
others tactile, olfactory-gustatory, or kinesthetic imagery. 
Little research has been done in this field since 1909 (Holt, 
1964), when Betts presented the view still current that the in- 
dividual who was good in one form of imagery tended to be 
good in other forms #00, suggesting that individual differences 
are unrelated to types of imaging. The present study examined 
this hypothesis and studied the interrelation of imagery modalities 
as measured by different assessment devices. 
The initial and still classical study in this field was made by 
Sir Francis Galton (1883) of individual differences in the clear- 
ness of visual imagery. Galton’s questionnaire has furnished the 


1 This investigation was supported by a Public Health Service Research 
Scientist Development Award Number K3-Mb-31,209 from the National Insti- 
tute of Mental Health, and by NIMH Grant # МН-12858, Cognitive Control 
of Learning and Performance, Perry London principal investigator, 

? This report is largely condensed from M. P. Leibovitz, "Individual differ- 
ences in imagery among Sensory modalities," unpublished doctoral dissertation, 
University of Southern California, 1968. 
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basis for many later investigations. He became interested in visual 
imagery as part of his efforts to find “the essential differences 
between the metal operations of different men.” He submitted a 
list of printed questions to his friends in the scientific community 
and was amazed that the great majority of scientists’ replies, 
contra his hypothesis, indicated that they were completely lack- 
ing in visual imagery. The opposite situation, on the other hand, 
prevailed among people Galton questioned in “general society." 
Most of them declared that they habitually had distinct and 
colorful visual images, and described them in great detail. 

Many textbooks in psychology, including McKellar’s on im- 
agery (1957), mistakenly attribute to Galton the classification of 
people into "visile," “audile,” and “motile” imagery types, but 
it was Chareot who independently developed the notion of dis- 
tinct types. Even Boring states: “His (Galton’s) questionnaire 
for determining types . . . is known to every psychologist (1957).” 
In Galton’s original work, however, no such typology was ever 
alluded to. 

The most important study of imagery to date, in relation to the 
question of types, is that of Betts, whose 1909 monograph, 
“The Distribution and Functions of Mental Imagery,” is still 
the basis for accepted notions on the subject. Vinacke (1952) 
Teports: “It was investigated by Betts, who convincingly showed 
that people cannot be classified according to their dominant 
imagery. Rather those persons who have imagery at all tend to 
display comparable degrees of all kinds." Woodworth and Schlos- 
berg (1964) state: "But a very careful study by Betts (1909), 
with an expanded form of Galton's.questionnaire, got pretty con- 
vineing evidence that the individual who was good in one form 
of imagery tended to be good in other forms. too.” Even so, ош 
analysis of Betts’s work makes it appear that some of. his con- 
clusions may incorrectly interpret his own findings, which need 
to be reexamined. 

Betts devised a test totalling 150 questions, 40 for visual 
imagery, 20 for auditory, cutaneous, kinesthetic, gustatory, and 
olfactory imagery, and 10 for “organic” imagery. He listed seven 
grades of clearness and vividness, from "perfectly clear" to "no 
object present." These were printed on а separate key which 
Subjects could refer to as a standard for rating their imagery. 
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His 143 subjects were comprised of four groups, three of psychology 
students and one of 18 trained psychologists. 

Betts found a tendency on the part of relatively untrained 
observers to overrate the clearness and vividness of their imagery, 
but this tendency decreased as they gained experience with the 
tests. He also found that the psychologists were more than seven 
times as deficient in imagery as the other three groups, which 
he explained by Galton’s hypothesis that persons who deal chiefly 
with abstract forms of thinking finally come to find their power 
of imaging greatly diminished. ^ 

His most important conclusions, however, concerned imagery 
dominance: 

"Ability in voluntary imagery is distributed much more 
evenly among the different types of images than has been 
commonly thought. Measured by the proportion of images 
classified under the highest three degrees of clearness and 
vividness, all the seven types of images studied are included 
between 68% and 50%, the average variation being only 5%.” 

The figures submitted by Betts actually have little bearing on 
dominance, however, because he assigned the items to partic- 
ular modalities a priori, and never asked the subjects for their 
assignment. Even so, when we examine the wide variations shown 
in each of the four groups, the data do not support his conclusion 
about the evenness of the distribution of the different modal- 
ities of imagery. The college students in the first three groups 
yield‘ almost opposite results from the psychologists in the fourth 
' group which make questionable how typical either is of most 
people. Finally, there are substantial differences in imagery even 
among the students in the first three groups. All this makes 
Betts’s conclusions doubtful. 

Despite acceptance of Betts's views, 
empirical studies showing differential dominance o 
imagery modality. Griffitts (1927) devised seven imagery tests, 
three of vividness, one of dominance, and three of verbal imagery. 
Seventy-five stimulus words and 50 stimulus sentences were used 
to elicit more than one modality of imagery. After reporting 
two modalities for a stimulus, the subject was asked to rate 
relative dominance on a 7-рой ‚ scale. Visual imagery was ranked 
first by 92 per cent of the 112/8 jects tested, but an examination 


there have been some 
{ one or another 
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of the stimulus materials shows that most stimuli were best suited 
visual imagery rather than to other modalities. Thus Griffitts’ 
rankings of dominance are open to question. 

Brower (1947) studied the images aroused by reference to 
onions frying in a pan and obtained the descending rank orderi 
of imagery modalities: visual, auditory, tactile, kinesthetic, olf 
tory-gustatory. 

Schlargel (1953), in a modification of Griffitts’ test of cone 
imagery, asked the subjects to respond with their initial mental 
image. The test was given to a congenitally blind group, 8 
partially blind group, and a sighted control group. The rank о 
of responses by the 78 high school subjects in the sighted control 
was: Visual, 42%; Auditory, 31%; Kinesthetie, 10%; Tactile 
temperature, 10%; Olfactory-gustatory, 7%; Unknown, 4%. The 
partially blind (less than 20/200 in both eyes) showed a highe 
percentage of imagery in the visual modality (59%) than 
sighted control (42%), but the rank order of the various modali 
was the same. 

Finally, Roe (1951) reported data from a study of the personal- 
ities of 64 eminent research scientists. Her results indicate thav 
biologists and experimental physicists, according to their ow 
reports, are high in visual imagery, while theoretical physi 
psychologists, and anthropologists are disposed toward ver 
rather than imagery processes. 

It is quite evident from the literature that the question 
dominance among imagery modalities still remains unans 
The main objective of this study was to provide a clear- 
answer to this problem. Doing so necessitated the construction 
а suitable new instrument to assess imagery, since the ambigu 
in the literature is partly a function of deficiencies of earlit 
instruments. Based on the observations of Galton, Griffitts, 
Roe, and on some personal observations, we hypothesized t 
there are dominant and subordinate imagery patterns 8 
different people, such that: 

1. People differ in their frequency of imaging in diff 
Sensory modalities. 

2. In a valid test of imagery dominance, the most СОП 
descending rank of order of imagery responses for differ 
sensory modalities is: visual, auditory, kinesthetic, tactile, £ 
olfactory-gustatory. 


ea 


тру 
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Method 


The general strategy of the study required the construction 
of an imagery test that would tap the frequency of imagery in 
different sense modalities with a selection of items such that 
each modality could be equally represented. Data from this in- 
strument, as well as from some peripheral measures of imagery, 
were then analyzed by correlational techniques, including factor 
analysis, and by inferential techniques, especially analysis of 
variance 


Instruments 


1. Individual Imagery Test. Since none of the existing imagery 
tests was considered adequate for our purposes, a new test was 
devised. It consisted of 40 words selected so that each would 
elicit imagery primarily in only one of the five modalities; see, 
hear, touch, taste-smell, and movement. It was presumed that 
the dominant imagery response of most people for each of the 
eight stimuli selected for a given modality would correspond to 
the authors’ own images, but this assumption was not tested prior 
to using the test. This form of the instrument was therefore called 
the “a priori” or “rational” imagery scale. The anticipated primary 
categories for each of the 40 words were: Vision (See): fire, fog, 
lightning, ocean, rainbow, sunset, the American flag, and rose 
garden. Audition, (Hear): applause, a shriek, God Bless America, 
ringing, snoring, typing, coughing, and onions frying. Touch: clay, 
fur, itching sand, showering, slime, toothache, and wet. Taste-Smell: 
bacon and eggs, cigar smoking, fresh paint, gasoline, hot chocolate, 
mothballs, new-mown hay, and perfume. Kinesthesia (Movement): 
breathing, sneezing, dizziness, floating, nausea, pole-vaulting, 
skating, and walking. 

The test was arranged for individual administration using 
a Q-sort method. Each word was printed on a colored 3" x 5" 
card, a blue card to designate the primary modality assigned 
the item, a yellow card to indicate the secondary modality, and 
an orange card for any third choice modality that the item 
might arouse. There were six 5” X 8” cards printed SEE, HEAR, 


и: n 

з Тһе study was part of а larger project in which several experimental 
Procedures and statistical analyses were pooled or dove-tailed. This report, 
therefore, is limited to those aspects of procedure and analysis relevant to 
Imagery dominance alone. 
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TOUCH, TASTE-SMELL, MOVEMENT and DISCARD, and 
all the individual item cards had to be assigned among these 
six categories. 

2. Group Imagery Test. The group imagery test was the London- 
Robinson test for retention of images (London and Robinson, 
1968). Originally used on children, this test consists of 20 line 
drawings, 10 of which are representations of familiar objects 
(Things) and 10 of which are unfamiliar figures (Blobs). The 
drawings are presented individually by means of a slide pro- 
jector to groups of Ss, and memory for the designs is tested 
by having Ss select the correct designs from two charts, each 
with 50 line drawings of “Things” and “Blobs” respectively. 


Subjects 


One hundred twenty-six female Ss who ranged in age from 16 
to 61 initially participated in this study. They were all volunteers 
solicited by advertisement in the local newspaper, from recom- 
mendations of student experimenters, and from feature stories 
in the newspaper. 


Procedure 


The experiment was conducted in two sessions, one week apart. 
The first was a group session, in which the London-Robinson 
Imagery Test was administered, and the second was an individual 
session, in which the subject received the Individual Image 
Test. Since the instructions for the latter were quite important, 
they will be described in some detail. 

Each subject was first interviewed by the experimenter to 
obtain some basic information on age, education, employment, 
and how she came to participate in the experiment. Then the 
experimenter asked her to respond to the following instructions 
and questions: 

1. Close your eyes and tell me if you can see any member OF 
your family? Describe their features. Can you get a visual imal 
of one of your rooms in your home? Describe it in detail. С 
you get а visual image of a fruit bowl? Describe their colors andi 
tell me what, if any, fruit you see. 

2. Can you see the images with your eyes open? 

3. Do the visual images ever block out the real visual scene? 
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Then the subjects were read the following instructions: 

The purpose of this experiment is to find out how people form 
different kinds of mental images. We want to see how different 
ideas can arouse different images in your mind. There are 
some kinds of things which people imagine mainly in a visual 
form, others they imagine most as sounds, still others as 
smells, tastes, touches, and movements, and of course, many 
different combinations of these different senses. For example, 
when you think of a mountain, the main image that comes 
to your mind is probably the sight of the mountain. Some 
people might visualize it more clearly than others, but it is 
still a visual image in any case. On the other hand, if you 
think of a car horn, you probably imagine chiefly the sound 
of honking, which is an auditory image rather than a visual 
one. In the same way, the thought of a pin prick probably 
brings to mind a tactile image, in which the main sense is 
the feeling of touch. Imagining yourself falling brings a 
feeling of movement or kinesthesis. Thinking intensely about 
ice cream makes you imagine taste and about ammonia makes 
you imagine smell. 

What we are going to do in this test is give you a group of 
words that we want you to imagine intensely so you can 
report on the kinds of imagery they arouse in you. It may 
be helpful after you read each word on these cards to close 
your eyes while imagining it. Then we want you to decide what 
are the main kinds of imagery each word calls forth and to 
report on them. 

Here is how you do it. There are five kinds of imagery 
shown on these five cards: (Lay out in this order, subjects’ 
left to right) SEE, HEAR, TOUCH, TASTE-SMELL, AND 
MOVEMENT. Notice we have combined taste and smell 
because, for most people, they always go together. Now here 
are forty different ideas, one on each of these cards. All 
you have to do, when you decide what your main imagery is 
for a word, is to place the blue card on top of the proper card. 
Of course, some words will arouse more than one kind of im- 
agery, so we have provided you with three cards for each 
word. The blue card is for your primary or main mental image, 
the yellow card for the next most prominent one, and the 
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orange card for the next. If you have only one kind of ima 
put the blue card in that pile and put the other cards in th 
discard pile. If you have only two kinds, use the blue ca 
for the first, the yellow card next, and discard the orang 
сага. 
Always use the blue card first, the yellow пегі, and the orang 


agery at all for some ideas. If this happens to you on som 
of the words, discard all three cards. 

Finally, it is important for you to understand that this 
not a "word association” test, for initial images. We are nd 
asking “for the first thing that comes to your mind,” 
for the main images that come to mind as you continue 
imagine the ideas on the cards. How well we can learn abo 
the process of imagination depends on how accurately ус 


оп top, the yellow card next, and the orange card on the botto 
Each of the 40 stimuli were presented in alphabetical order; 
more than two items of the same modality occurred in suce 
sion. The experimenter checked off the categories of blue, yellow, 
OF orange on а score sheet as the subject placed the item саге 
on each modality card. The average time for this test was 


test were selected a priori, without empirical testing. This Р 
cedure allowed some error in the placement of items in terms 
the imagery that actually would be elicited most often in а @ 
population, ie. we may not have guessed correctly in ter 
of а statistical criterion. Consequently, it was decided to exami 


a new set of subscales for each modality on basis. These SU 
scales will be referred to as the empirical subscales. 


40, since six items were not placed in а single imagery modality 
by the majority of subjects and were therefore eliminated (ocean, 
rose garden, typing, coughing, sneezing, and nausea). In addi- 
tion, two items were placed by the majority of subjects in a 
different modality than that originally predicted, and they were 
accordingly reclassified. Pole-vaulting was reclassified to the 
"see" modality from the "movement" modality, and "onlons- 
frying” was reclassified to the “taste-amell” modality from its 
^ priori "hear" category. The percentage of subjects’ responses 
for each of the 40 items in the imagery test is shown in Table 1. 


Resulta 


Though imagery scores for each subject on each modality 
were obtained by two different scoring procedures, yielding ra- 
tional and empirical subscales, there were so few differences 
between the two, that the results from the empirical subscales 
only will be presented here, 

The empirical subseale was scored in two 
scores consisted of the sum of first perp 
modality. Weighted scores were obtained 
of three for a first choles response, (то Sor. massed (ИШӘ ДЕ 
*ponses, and one for third choice responses, then normalizing 
scores for each modality to include eight items (Table 2). 

The first hypothesis зима that people досе ом Мари 
modality more than others, Table 3 presents the unweighted 
weighted means of all modalities for all 126 subjecta. Tbe hypothesis 
Was tested by a treatment-by-subject analysis of variance 
unweighted empirical subscores, in which the treatment-by-eubjeet. 
interaction was of greatest interest. The results 
the .05 level, which clearly supports the hypothesis 
differ in their facility in imaging with regard to 
modalities (Table 4), mL - An 

The second hypothesis stated . 
order of dominant imagery modalities in the general 
auditory, kinesthetic, tactile, olfactory-gustatory. This 
ported by the weighted scoring system. Of 
Weighted values of all responses, the percentages 
modality are: Visual 33%, auditory 19%, 
16% and olfactory-gustatory 14%. The results of 


Е 
ғ 


Hid 
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TABLE 1 
Percentage of First Choice Responses in Each Modality (N = 126) 


>= 


Item See Hear Touch Taste-smell Movement Discard 
э б ا و ا ت ا‎ a с а 
Applause 13 738 2 0 10 2 
A shriek 2 90 1 0 6 1 
Bacon & Eggs 17 4 0 76 0 3 
Breathing 4 19 6 1 69 1 
Cigar Smoking 12 0 2 86 0 0 
Clay 27 о 65 2 5 1 
Coughing 4 56 5 0 34 1 
Dizziness 7 0 18 0 n 4 
Fire 67 Ц 10 І 1 
Floating 12 0 18 0 66 4 
Fog 75 1 15 8 0 1 
Fresh Paint 17 0 8 18 1 1 
Fur 20 1 77 1 1 0 
God Bless America 7 69 E 0 4 14 
Gasoline 8 70 1 90 0 1 
Hot Chocolate 17 0 9 73 0 1 
Itching 7 1 62 ny 28 1 
Lightning 88 7 0 0 1 4 
Mothballs 38 0 1 79 0 2 
Nausea 5 1 20 20 42 12 
New-mown Hay 21 1 1 72 70 5 
Осеап 52 25 3 1 4 2 
Onions Frying 13 10 1 74 0 2 
Perfume 8 0 2 86 0 4 
Pole-vaulting 63 0 2 wu 33 1 
Rainbow 95 0 0 0 1 4 
Ringing 3 93 3 0 0 1 
Rose garden 58 1 2 40 0 4 
Sand 33 1 64 2 0 0 
Showering 11 17 59 3 8 2 
Skating 27 дим $58 0 60 4 
Slime 24 1 64 6 гл 1 
Sneezing 6 26 17 2 48 $ 
Snoring 4 85 1 0 3 4 
Sunset 92 2 1 0 2 3 
The American Flag — 86 0 5 1 3 5 
Toothache 7 0 72 2 6 13 
Typing 11 Занат 0 29 1 
Walking 14 3 2 0 79 2 
Wet 17 0 73 3 бү 6 


a The modal response for each item is underlined. 
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TABLE 2 
Пет Analysis of 84 Weighted Items Used in Empirical Subscale Total of All Scores 
Assigned to Each Item 
Taste- 

Item See Hear Touch Smell Movement 
Applause 161 322 64 2 107 
A shriek 68 358 12 0 67 
Bacon & eggs 216 68 15 324 2 
Breathing 81 179 29 12 296 
Cigar smoking 200 3 19 346 10 
Clay 211 5 310 70 42 
Dizziness 81 8 83 4 292 
Fire 311 116 105 119 19 
Floating 137 6 104 0 276 
Fog 325 17 142 87 п 
Fresh paint 198 0 98 316 8 
Fur 230 6 330 42 20 
God Bless America 48 287 26 0 24 
Gasoline 138 17 25 351 3 
Hot chocolate 210 2 93 331 5 
Ttching 61 22 282 3 186 
Lightning 352 51 13 3 31 
Mothballs 209 8 44 325 0 
New-mown hay 212 6 75 304 4 
Onions frying 188 130 5 320 2 
Perfume 118 1 91 339 9 
Pole-vaulting 298 20 24 5 179 
Rainbow 362 1 20 6 13 
Ringing 83 355 24 0. 27 
Sand 247 13 315 24 25 
Showering 11 174 265 35 67 
Skating 210 96 39 4 zH 
Slime 211 5 279 76 27 
Snoring 89 340 17 2 “ч 
Sunset 357 11 25 2 A 
The American flag 345 31 68 6 А 
Toothache 59 6 283 23 32 
Walking 159 91 35 5 326 
Wet 145 26 307 44 12 

Mad ONES a дыш АД) 

Total 6431 2781 3666 3530 2498 
No. items 7 5 8 9 5 
Normalized 

scores—based 

on 8 items per 

modality Б 7349 4449 3666 3138 3996 
% 33% 19% 16% 14% 18% 


assessing rank order of dominance also make vision the primary 
imaging modality, but yield different rank orders of the other 
Modalities. These are discussed later. 

Scores on a total of 46 variables, including the imagery scores, 
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TABLE 3 
Weighted and Unweighted Means of Imagery Subscales (Empirical Scales) (N =126) 
چ لے‎ 


Taste- 

Type of Mean See Movement Touch Hear Smell 
plac A eh learners) Ра Его. inc — 

Weighted 13.29 11.52 11.86 13.26 13.10 

Unweighted 2.66 2.30 2.37 2.65 2.62 
E TL 


were submitted to a correlational analysis and a subsequent 
factor analysis. The latter was a principle component analysis, 
rotated to а varimax criterion, with the number of factors 
rotated being equal to the number of eigen values greater than 
one, when the squared multiple correlation was used in the 
diagonal of the correlation matrix. Of the 1035 possible inter- 
correlations: 366 correlations were significant at the .05 level, 
indicating a high interrelationship between these variables 
(Table 5). 

The rotated factor matrix is presented in Table 6. The rotation 
criterion resulted in 13 factors which accounted for virtually all 
of the variance. With respect to imagery, the factors are especially 
clear. Six factors (II, IV, V, VIII, IX, XIII) comprehend all 
first choice imagery responses, and three factors (I, XJ, ХИ) 
account for all other imagery responses. 

The first factor accounted for approximately 26 per cent of the 
variance: It had high, positive loadings (greater than == .30) 
on virtually all second and third choice imagery scores for all 
sense modalities, and high negative loadings on the second an 
third choice discard scores. This pattern of loadings indicates 
that this factor reflects secondary imagery, ie. images which 
are neither commonly nor predictably aroused. 


TABLE 4 
Analysis of Variance of Imagery M Means (Empirical Scale) By Subj 
= 126) 

Source of Variation ај Mean Square F 
Subjects 125 2.48 3.00% 
Treatments (Modalities) 4 15.63 18.90% 
SxT 500 1.14 1,38 
Within Cells (Error) 2520 .83 
Total 3149 


*p < 001. 
** p < 05. 
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Factor II is clearly a primary imagery factor and accounted 
for approximately 13 per cent of the variance. Peculiarly 
enough, it relates the visual and taste-smell modalities inversely, 
showing high positive loading with primary “see” choices and 
high negative loadings with primary “taste-smell” choices. This 
suggests that subjects who rely heavily on the visual mode 
for their primary images tend not to use the taste-smell modality 
and vice-versa. This inference is supported by the high negative 
correlation (—.63) between the first choice “see” scores and the 
first choice “taste-smell” scores. 

Factor IV is clearly a kinesthetic imagery factor and accounted 
for approximately nine per cent of the variance. Its high negative 
loadings (—.89) mean that some subjects consistently avoid the use | 
of kinesthetic imagery. This tendency is independent of any other 
primary imagery tendencies. 

Factor V is clearly a tactile factor with a high positive 
loading (.92) on first choice (primary) “touch” and accounted 
for approximately eight per cent of the variance. Second choice 
(secondary) “touch” loads high on Factor XI (.82). An interesting 
relationship occurs between Factor V and XI. Second choice 
“touch,” which loads .82 on Factor ХТ, loads 12 on Factor У. 
This inverse relationship is true for all other high loads on Factor 
XI. 

Factor VIII, which accounts for five per cent of the variance, is 
а nonspecific imagery factor, ie., it reflects those items which 
subjects discarded because they could not form a primary image 
in any specific modality. It loads .82 for the empirical Discard 
subscales and has high negative loadings on the imagery mo- 
dalities of the empirical subscales (See —.62, Hear —.75, Touch 
—.28, Taste-smell —.49 and Movement —.27). 

Factor IX accounts for approximately five per cent of the 
Variance. We have named it the Hearing imagery factor, but is 
the least clear of our first choice (primary) factors. On the one 
hand, first choice “hearing” has a higher positive loading on 
this factor (.47) than on any other. On the other hand, another 
Variable (second choice “movement”) has a higher loading on 
this factor (.84), and another factor (XII —.58) is loaded more 
by first choice “hearing.” The latter loading (—.58), however, 
is probably better explained as a function of the relationship 
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1. Sex 


2. Age 
8. Subjective—image of familiar faces 
4, Subjective—image of room 
5. Subjective—color in image 
6. Subjective—image with eyes open 
7. Subjective—image blocks out real scene 
8. EEG Alpha (base rate) 
9. EEG Alpha during imagery task 
(Drewes) 
10. Group Imagery Test—things 
11. Group Imagery Test—blobs 
12. Group Imagery Test—total score 
13. Hypnotic Susceptibility Test 
14, Hypnotic Susceptibility Test—4 key 
items 
15. Rational Imagery Test—Total—see 
16. Rational Imagery Test—Total—hear 
17. Rational Imagery Test—Total—touch 
18. Rational Imagery Test—Total— 
taste-smell 
19. Rational Imagery Test—Total— 
movement 
20. Rational Imagery Test—Total—discard 
21. Rational Imagery Test—Total 
22. Rational Im. Test—see—1st choice 
23. Rational Im. Test—see—2nd choice 
24. Rational Im. Test—see—3rd choice 
25. Rational Im. Test—hear—1st choice 
26. Rational Im. Test—hear—2nd choice 
27. Rational Im. Test—hear—3rd choice 
28. Rational Im. Test—touch—lst choice 
29. Rational Im. Test—touch— 2nd choice 
30. Rational Im. Test—touch—3rd choice 
31. Rational Im. Test—taste-smell— 
1st choice 
32. Rational Im. Test—taste-smell— 
2nd choice 
33. Rational Im. Test—taste-smell— 
8rd choice 
34. Rational Im. Test—movement— 
1st choice 
35. Rational Im. Test —movement— 
2nd choice 
36. Rational Im. Test—movement— 
8rd choice 
37. Rational Im. Test—discard—1st choice 
38. Rational Im. Test—discard—2nd choice 
39. Rational Im. Test—discard—3rd choice 
40. Empirical Imagery Test—see 
41. Empirical Imagery Test—hear 
42. Empirical Imagery Test—touch 
43. Empirical Imagery Test—taste-smell 
44. Empirical Imagery Test—movement 
45. Empirical Imagery Test—average 
46. Empirical Im. Test—no response рег 
item 


Correlational Analysis of 48 Variables in Study 


TABLE 5 
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TABLE 5—(Continued) 
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TABLE 5—(Continued) 


1. Sex 


2. Age 

8. Subjective—image of familar faces 

4. Subjective—image of room 

5. Subjective—color in image 

6. Subjective—image with eyes open 

7. Subjective—image blocks out real scene 

8. EEG Alpha (base rate) 

9. EEG Alpha during imagery task 
(Drewes) 

10. Group Imagery Test—things 

11, Group Imagery Test—blobs 

12. Group Imagery Test—total score 

13. Hypnotic Susceptibility Test 

14. Hind mente Susceptibility Test—4 key 

items 

15. Rational Imagery Test—Total—seo 

16. Rational Imagery Test—Total—hear 

17. Rational Imagery Test—Total—touch 

18. Rational Imagery Test—Total— 


tas 
19. Rational Imagery Test—Total— 
movement 


80. Rational Im. Test—touch—3rd choice 


31, Rational Im. Test—tasteamell о в _ 
эз Retos та Tet ani en 02/200 т — 
зз. вай Iur Te ааай. 2 -01 35 9 -13 _ 
и. Rationi lm Tek mi EROR 1 —.02 38 48 06 18 — 
ле арке 12 —.13 —.07 -т 21 04 —.03 — 


35. Rational Im. Test—movement— 


38. 
39. 


40. А 

41. i 

4. х 

43. | 16-0 п. ла ла 

44. Empirical Imagery Test—movement J4 0106119 As dà 07 .70 


45. Empirical Imagery Test—average .21 .28 .08 К ~ .26 28 
46. Empirical Im. Test—no response per a 5 > 
item 
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TABLE 5—(Continued) 
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of hearing imagery to increasing age (which loads .67 on Factor 
XIII) than as a hearing factor. If anything, these two loadings 
suggest that XIII is an age factor. The only other important 
loading that seems unrelated to hearing on Factor IX is second 
choice “movement.” 

Factor XI accounts for approximately four per cent of the 
variance and, as previously noted, loads .82 on second choice 
“touch.” The inverse relationship of the loadings on Factors 
V and XI has previously been mentioned. It is significant that 
second choice “touch” is the only secondary image variable 
that does not load significantly on Factor I. For these reasons, 
Factor XI is considered to be a separate factor of second choice 
"touch." 

Factor XII which accounts for three per cent of the variance 
is a secondary imaging factor of “taste-smell.” There is no pri- 
mary taste factor since it is confounded with and suppressed by 
visual imagery. In this connection, it is interesting to observe 
that vision biologically suppresses olfactory senses. Phylogenetically, 
the more highly developed the cortical sensory areas of the 
nervous system, the less well developed are the olfactory- 
gustatory systems. 

The remaining four factors (III, IV, УП, X) are clearly factors 
representing a pure form of the four tests included in this 
study: (1) subjects’ pre-test reports on imagery—Factor ІҢ, 
(2) Group Imagery Test—Factor VII, (3) Susceptibility Test— 
Factor VI, (4) EEG Alpha-Factor Х. In each case, the factors 
load highly with the respective test variables and with virtually 
no other factor. It must be concluded that, even though the above 
tests purport to relate to imagery, they do not measure the same 
things as the individual imagery test. 


Discussion 


The most important finding of the study clearly supports the 
hypothesis that imaging is not a general trait: people use one 
modality more than others. In examining their first choice images 
97 (77%) of the 126 subjects who showed their dominance for 
& particular sensory modality appeared deficient in one or more of 
the other sensory modalities (measured by four or less first 
choice responses in a given modality out of the predicted 
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eight). There were 71 subjects (56%) who placed 12 or more 
first choice responses in a given sensory modality (which is 30% or 
more out of the possible 40 first choices) indicating their strong 
choice (dominance) for а partieular sensory modality. As а 
further indication of the wide individual differences in imaging, 
there were seven subjects who recorded over 50 per cent of their 
first choice responses in the visual modality, and on the other 
end of the scale, there were five subjects who had no visual imagery 
whatsoever. These results contradict previous research which 
suggests that imagery tends to be good in other forms. It must 
be pointed out, however, that the research of Betts (1909) and 
Sheehan (1966) involved a determination of the “vividness” of 
imagery which was not tested in this study. The imagery test 
used here was designed to test frequency of choice of imagery 
modalities and not to make comparisons of “vividness.” The 
question, then, of whether some people are better imagers than 
others cannot be answered by this study in terms of the "quality" 
of their mental imagery, but only in terms of the “quantity” of 
mental imagery they have. The measurement obtained is based 
on total score (the sum of first, second and third choice) less 
discard score; the quantity of a subject’s imagery depends on 
how many items can be imaged in any one of the five sensory 
modalities minus those which cannot be imaged and are therefore 
placed in the discard category. 

We have determined that there is a hierarchy of imagery 
modalities, but the precise ranking is open to question depending 
on the scores used to determine rank. As previously noted, the 
descending rank order of responses obtained from the total of all 
scores for each of the sensory modalities (based on 34 out of the 
40 items that met the criterion for significance) was visual 33%, 
auditory 19%, kinesthetic 18%, tactile 16%, and olfactory- 
gustatory 14%. The visual modality is clearly dominant, while 
the other four modalities tend to cluster without showing signif- 
icant differences among them. When we examine first choice 
responses only, however, we obtain the following rank order: 
Visual 54%, olfactory-gustatory 24%, tactile 15%, kinesthetic 
5%, and auditory 2%. This again confirms the dominance of the 
visual imagery modality, but there is a reversal of ranks between 
auditory and olfactory-gustatory modalities. We obtain yet a 
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third hierarchy of imagery modalities when we examine the pattern 
of factor loadings. The descending rank order based on the per- 
centage of the total variance accounted for (Table 6) is: visual 
(with negative high loadings on olfactory-gustatory) 13%, kines- 
thetic 9%, tactile 8%, auditory 5%. All three of these analyses 
confirm conclusively that vision is the dominant imagery modality, 
and the other four modalities tend to vary in rank order depending 
on the scoring basis used. It is interesting to note that, as we go up 
the phylogenetic scale, reptiles, birds, and primates make increas- 
ingly little use of smell and depend for both safety and food on 
vision. 

The factor analysis, which isolated 13 factors accounting for 
virtually all of the variance, demonstrates clearly that the im- 
agery factors are not contaminated with the other tests. Two 
phenomena that we thought might be related to imagery, EEG 
alpha and hypnotic susceptibility, came out as separate factors 
without loading at all on any of the imagery factors. The factor 
analysis adds further weight to the separateness of the imagery 
modalities since the results shown in Table 6 reflect the six 
factors which comprehand all first choice imagery responses. 

There are some interesting implications that follow from our 
finding of individual differences in dominance of imagery modal- 
ities. The first question that seems to require an answer is why 
people come to rely on a given imagery modality as their dominant 
one. Is it a hereditary factor, as Galton has suggested or does 
it result largely from training or experience. If the latter situation 
prevails, how early in one’s life and by what means is this 
dominance set and how changeable is it? 

The second question concerns the relationship of imagery 
dominance to aptitudes and to vocational training. Griffitts 
(1924) points out that visual imagery is essential for success 
in most branches of engineering, architecture, designing, il- 
lustrating, cartoon drawing, ete. Obviously, imagery dominance 
in another modality could be important factor to consider in 
vocational counseling. 

A third question concerns the implications which individual 
differences in imagery might have in the area of communication. 
Difficulties in interpersonal communication may sometimes be 
related to differences in imagery. We have often heard one 
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person in an argument state, “I am sorry you can’t see it my 
way.” Wide individual variations in perception may partly depend 
on imagery, and the nature of one’s own imagery may be such as to 
introduce into one’s thinking a subtle personal parochialism that 
can seriously impede adequate communication. 

A fourth question concerns imagery dominance in relation to 
cortical localization as assessed by evoked potential responses 
and EEG patterns. This question leads to a consideration of 
the physiological substratum for imagery differences, an area 
of research as yet unexplored. We now have research underway 
on the relationship of imagery dominance to brain wave patterns. 
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STANDARD ERRORS OF ESTIMATE IN ITEM-EXAMINEE 
SAMPLING AS А FUNCTION OF TEST RELIABILITY, 
VARIATION IN ITEM DIFFICULTY INDICES AND 
DEGREE OF SKEWNESS IN THE NORMATIVE 
DISTRIBUTION 


DAVID M. SHOEMAKER 


Southwest Regional Laboratory for Educational 
Research and Development 


IrEM-EXAMINEE sampling is a procedure in which a set of K 
test items is subdivided into ё subtests containing К items each 
with each subtest administered to т examinees selected from 
the population of N examinees. Although each examinee receives 
only a proportion of the complete set of items, the statistical 
procedures given by Lord (1960) permit the researcher to 
estimate the mean and variance of the test score distribution 
which would have been obtained by testing all N examinees 
over all K items. Although it is well-known that item-examinee 
sampling is the preferred design for a variety of testing prob- 
lems, few procedural guidelines are available to aid the re- 
searcher in determining the most appropriate number of sub- 
tests, number of items per subtest, and number of examinees 
per subtest. While previous investigations (Shoemaker, 1970, 
1970b, 1971; Lord and Novick, 1968) have outlined several guide- 
lines, notably: 

1. The standard error of estimate for û and $ decrease generally 
with increases in the number of observations (defined as the 
product tkn). 

2. In estimating p ог о, for a given number of observations, in- 
creasing the number of subtests ¢ is preferable to increases 
ink or n. 
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3. For a given test length K, sampling plans having the same 
number of observations have generally the same standard 
error of estimate for ê* but different standard errors of estimate 
for û. (It should be noted that these last two guidelines were 
the result of an item-examinee sampling investigation where, 
for all sampling plans considered, tk exceeded К and c," was 
equal approximately to zero.) 

4. When tk — K, the most efficient sampling plan for estimating u 
is one in which ¢ = К and k = 1. 

Conspicuous by its absence in this series of investigations was a 
Systematic examination of the effect on standard errors of 
estimate due to variations in test reliability aso. The investiga- 
tion described herein was designed primarily to remedy this 
situation. Parameters considered additionally were the variance 
of item difficulty indices ор? and degree of skewness in the normative 
distribution. The parameters estimated were the mean test score 
р and the standard deviation of test scores c. 


Calculating Standard Errors of Estimate 
Standard Error Of Estimate For û 
Lord and Novick (1968, equation 11.123) have derived 
algebraically an equation from which the standard error of 
estimate for the mean proportion correct score under multiple 
matrix sampling may be computed. Modifying this equation for 
the mean test score results in 


1 1 
VAR M bs br = 007 = a 
X No, ((К — а — 1) — kn(t — 1)} 
+ Ко’ (№ — n)(k — 1) — kn(t — 1)} 
‘+ ЖК — ВСК — KN — n) + kn(t — 1))], (11.12.32) 
where 
c^ refers to the variance of test scores, 
vy. to the variance of the item difficulty indices, and 
û to the estimate of the mean test score obtained from multiple 
matrix sampling. 
Equation (11.12.3а) is appropriate given (a) items and examinees 
are sampled randomly and without replacement, and (b) # is less 


\ 
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than or equal to К. Of course, 8Е(Д) = VAR (û). Making predic- 
tions from (11.12.3a) is facilitated by modifying the equation for 
the case where № is countably infinite. Doing this produces 


var = [хь] ојк – ве = D — ве = mi 
+ Kor(k — 1) + a(K — (К — E). (11.12.3b) 


The expected change in SZ(A) given changes in the values of the 
parameters defining the normative distribution is, for a given sampling 
plan, determined easily from the ratio (using 11.12.3b) 


VAR (Q _ EZ | 
VAR (A) BK—1 


e ED, 80-DE | 
CEE DE КН a EE LC E 
[e-em +g, RD 
W 


where 

VAR (р)' denotes VAR (д) for a normative distribution having 
parameters р’, 

VAR (д) denotes VAR (A) for a normative distribution having 
parameters р, 

8 refers to the skewness parameter of the normative 
distribution under consideration (Because û is unbiased 
in multiple matrix sampling, the expected value of ĝ 
is K/s where, as is frequently the case, s — 2 for normal 
normative distributions and 1 Z s Z 2for negatively- 
skewed normative distributions), 


В = K'/K, 
and 
А = o? ê. 


For example, if the same sampling plan were to be used for two 
tests having lengths K’ and K, with K less than K’ and identical 
values for the parameters ао, c, and degree of skewness in the 
normative distribution, it is expected from (1) that SE(g)' will 
exceed SE(f). In this case, В and ^ are greater than unity, & = 8, 
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7," = су, and оду = озо and the ratio defined by (1) will exceed 
unity. 
For a sampling plan having tk less than or equal to К, the following 
changes in SH(f) with changes in parameters defining the normative 
distribution proceed directly from (1): 
‘1. For a given degree of skewness in the normative distribution, 
o, and озо, ап increase in К results in an increase in SE(f). 

2. For a given degree of skewness in the normative distribution, 
К and am, an increase in о,’ results in an increase in SH(A) 
when ik is less than К; when tk = К, ЅЕ(д) decreases as c," 
increases. 

3. For a given degree of skewness, озо, К and o,”, ЗЕ(й) increases 

AS a, increases. 

4, For a given К, az and c,', ВЕ(й) decreases as the degree of 

skewness in the normative distribution increases. 

In the majority of multiple matrix sampling investigations c? 
and ,ڍ‎ are unknown and must be estimated before VAR (д) defined 
by 11.12.3а ог 11.12.3b can be computed. If all parameters are 
known, SH(f) is computed exactly; if any parameters are estimated, 
the obtained value is an approximation. 

Equations (11.12.38), (11.12.3b), and (1) are applicable for 
sampling plans having tk less than or equal to K. There are occasions, 
however, when a researcher will construct subtests such that tk is 
greater than К. For example, parameters for a 60-item test may be 
estimated by constructing 10 subtests containing six items each 
with each subtest administered to a class of examinees. However, 
the allotted testing time per class may be twice the time required 
to complete the 6-item test. Under these circumstances, the re- 
searcher would be wise to increase the subtest length from six to 
12 items to take advantage of the available testing time. With the 
number of observations being doubled, SE(A) and SE(4) for the 
k = 12 sampling plan is expected to be markedly less than the 
corresponding standard errors of estimate for the k = 6 sampling 
plan. The point to be made is that equations (11.12.3a), (11.12.3b), 
and (1) are appropriate for the К = 6 plan but inappropriate for 
the k = 12 plan. It should be noted that, if a sampling plan is selected 
such that tk is greater than К, tk should be an integer multiple of К 
and items should be selected subject to the restriction that across 
subtests each item appears an equal number of times. 


көзөлү 
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Standard Error Of Estimate for & 


No equation comparable to (11.12.3a) exists in the literature for 
computing SH(¢) under multiple matrix sampling. Indeed, the 
task would not be a casual undertaking. The standard error of 
estimate for é is approximated readily, however, through post 
mortem item-examinee sampling: given a normative distribution, 
various item-examinee samples are selected randomly from this 
data base and used to estimate parameters of the distribution from 
which they have been sampled. If 5Е($) for a particular sampling 
plan were to be approximated through post mortem sampling, the 
sampling plan would be applied repeatedly to the data base with 
each of the т replications producing one estimate of о. SE(4) is the 
standard deviation of the г estimates of о. It is expected that SE(4) 
approximates more closely the true value of SH(é) as г increases. 
Standard errors of estimate for any other parameter may be ap- 
proximated similarly. 


Method 


The research design was one of post mortem item-examinee 
sampling with the required data bases generated through a computer 
simulation model. Post mortem sampling was selected because 
equations were not available for computing SH(é) exactly or for 
computing 5Е(й) when tk exceeds К. A simulation model was used 
primarily because of the anticipated difficulty in locating existing 
data bases having among themselves the prerequisite variations 
in parameters. The simulation model permitted a degree of flexibility 
in parameter manipulation not found in previous post mortem 
item-examinee sampling investigations. 

The simulation model: generates dichotomously-scored item 
scores and test scores having prescribed characteristics. Subject 
to manipulation within the model are the mean test score p, 
variance of test scores о?, test reliability азо, mean item difficulty 
index, variance of item difficulty indices оу’, and degree of skew- 
ness in the normative distribution of test scores. It is a well- 


1A listing and expanded writeup of the computer program implementing 
the simulation model is available upon request from the author. The assistance 
of Leonard L. Streeter (Graduate School of Education, University of Cali- 
fornia at Los Angeles) in writing portions of the program is acknowledged and 
appreciated. 
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known fact, and one attended to within the model, that these 
parameters are not mutually independent. 

All normative distributions generated were for a 40-item test 
(K = 40). Parameters manipulated systematically were (a) the 
test reliability (озо = .80, .95), (b) the variance of item difficulty 
indices (op? = .00, .05), and (c) the degree of skewness in the 
normative distribution (distributed normally, markedly nega- 
tively-skewed). For all negatively-skewed normative distribu- 
tions, the mean test score was at 90 per cent of items answered 
correctly and only o,? = .00 was used. Seven item-examinee 
sampling plans, listed in the first column of Table 1, were used with 
each of the six normative distributions. Additionally, the se- 
lection of item-examinee sampling plans permitted a detailed 
examination of the effects of variations in t, К, and n on the 
respective standard errors of estimate. Specifically, in procedures 
1, 2 and 3, number of subtests t was varied with k and n held 
constant; in 1, 4 and 5, number of items per subtest k was 
varied with $ and п constant; and, in 1, 6 and 7, number of 
examinees per subtest л was varied with ¢ and k constant. The 
results of each sampling plan were replicated 25 times. The 
parameters estimated were the population mean test score и and 
the standard deviation o of the population test scores. 


Results 


All results are recorded in Tables 1 and 2. In the first sampling 
plan, for example, four subtests containing five items each were 
constructed with each subtest administered to 30 examinees. Each 
replication of this sampling plan using a normal normative distri- 
bution with ao = .80, с = .00, u = 20 and о = 6.742 produced 
one pooled estimate of џ and one pooled estimate of с. The standard 
deviation of the 25 estimates of џ (or, the standard error of estimate 
for û) was .697 (Table 1); the corresponding value for ¢ was 2.446 
(Table 2). The remaining values in Tables 1 and 2 are interpreted 
similarly. 

The standard errors for the population mean test score were 
computed to determine the validity of the simulation model and the 
effect of the number replications on the results. For those sampling 
plans with tk being less than or equal to K, the results in Table 1 
conform generally to those expected from an analysis of the ratio 
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TABLE 1 


Standard Errors of Estimate for f for a 40-Item Test as a Function of Variations In 
op, изо and Degree of Skewness in the Normative Distribution 


Normal Normative Distribution Neg.-Skewed Dist. 


а = 80 а = 95 а = .80 а = .95 

Sampling Plan | —————— —— — 
(t/k/n) of = 0 of = .05 ср = 0 of = .05 o = 0 с? = 0 
1. (04/05/030) .697 1.818 1.376 1.872 591 .650 
2. (08/05/030) .519 .566 .859 .935 284 434 
3. (16/05/030) .374 1.294 .573 1.076 326 .372 
4. (04/10/030) .589 .482 .805 .878 .395 .554 
5. (04/20/030) .631 .892 1.139 1.157 .297 .543 
6. (04/05/060) .665 1.539 .872 1.353 .405 .472 
7. (04/05/120) .944 1.445 .578 1.511 .203 .984 
nu 20.000 20.000 20.000 20.000 36.000 36.000 
т 6.742 6.030 11.644 10.415 4.045 6.987 


defined by (1). There is one discrepancy, however, which should 
be noted: for a given озо, К and degree of skewness, an increase in 
су? is expected to produce a decrease in SE(û) when tk = К. As 
can be seen from Table 1 (sampling plans 2 and 4), this expected 
decrease did not occur consistently. Although the standard errors 
are similar and the expected decrease is slight, a discrepancy such 
as this provides a measure of the price paid in using only 25 replica- 
tions of each sampling plan. 

Considering those sampling plans where tk is less than or 
equal to K, the results support the general conclusion that, over 


TABLE 2 


Standard Errors of Estimate for 8 for а 40-Item Test as a Function of Variations In 
cp", озо and Degree of Skewness In the Normative Distribution 


Normal Normative Distribution Neg.-Skewed Dist. 


а = 80 а = .95 а = .80 а = .95 
Sampling Plan о 

(t/k/n) of = 0 of = .05 "وه‎ = 0 ср = .05 of = 0 of = 0 

1. (04/05/030) 2.446 1.776 1.331 1.189 1.423 1.290 
2. (08/05/030) 1.621 -974 1.021 .696 . 763 .740 
3. (16/05/030) 1.282 .643 .546 .461 „443 „621 
4. (04/10/030) 1.254 .482 .740 .698 .838 .953 
5. (04/20/030) .575 .892 .653 .652 .876 1.083 
6. (04/05/060) 1.946 1.539 1.077 .915 1.052 1.112 
7. (04/05/120) .086 .883 .934 .595 825 .678 


2 Ё 
и 20.000 20.000 20.000 20.000 36.000 36.000 
6.742 6.030 11.644 10.415 4.045 6.987 
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the range of observations sampled, the greater the number of 
observations used, the smaller the standard error of estimating 
parameters. (That this was not the case when tk exceeded K is 
the subject of later discussion.) Regarding variations in & К, and n, 
it was generally the case that for the sampling plans considered 
variations in n were least effective in reducing standard errors of 
estimate. For the normal normative distributions, variations in k 
were most effective in reducing standard errors of estimate; for 
negatively-skewed normative distributions, variations in і were 
most effective. 

Of major interest was the change in standard errors of estimate 
due to variations in e," and азо. It was generally the case that, for 
a given озо, К and sampling plan, as c," increased, SE(A) increased 
and SE(4) decreased; for a given о, К and sampling plan, as озо 
increased, SH(a) increased and SE(é) decreased (although the 
decrease in SE(4) was found less consistently when the normative 
distribution was negatively-skewed.) 

Discussion 

When tk is less than or equal to К the relative magnitude of the 
standard errors of estimate for д for the seven sampling plans over 
the six normative distributions considered herein correspond generally 
to that expected from an analysis of the ratio defined by (1). This 
result lends credibility to the standard errors of estimate for ¢ 
determined through use of the simulation model. Of major concern 
is the result that, when tk exceeded K, the standard errors of estimate 
were greater than those observed when tk = К, even though the 
number of observations acquired by the sampling plan had been 
doubled. This is undoubtedly a reflection of the manner in which 
subtests were constructed. Specifically, in this investigation items 
were assigned randomly to each subtest: the sampling of items was 
without replacement for each subtest but with replacement among 
subtests. As such, it is unlikely that each item appeared an equal 
number of times among subtests. Results such as these underscore 
the fact that, when tk exceeds К, (a) tk should be an integer multiple 
of K, and (b) subtest items should be selected randomly but subject 
to the restriction that each item appear tk/K times among the ¢ 
subtests. Failure to do this results in a marked increase in the standard 
error of estimate, 
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Asis demonstrated by the results озо and о,” are factors of primary 
importance in multiple matrix sampling. Specifically, for a given 
оз К and degree of skewness in the normative distribution, an 
increase in c," is accompanied by an increase in SE(f) and a de- 
crease in SE(é); for a given а, , К and degree of skewness, an increase 
in азо is accompanied by an increase in SE(f) and a decrease in 
SE(6). The implications of these results for researchers attempting 
to implement multiple matrix sampling are (а) & priori estimates 
of о, and az are important in selecting а sampling plan, (b) а larger 
number of observations is required to estimate д with a prescribed 
degree of accuracy when озо is high as compared to озо being low 
with the opposite the case for о, (c) a larger number of observations 
is required to estimate » with a prescribed degree of accuracy when 
a, is high as compared to low, with the opposite, once again, the 
case for о, (d) a given sampling plan cannot be expected to have the 
same standard error of estimate per parameter for different normative 
distributions even though the test length K may be the same, and 
(e) for a given standard error of estimate, fewer observations are 
required to estimate parameters for a skewed distribution than for 
а normal normative distribution. 

It should be noted as a guideline that, in the selection of а 
specific sampling plan given tk is less than or equal to K, (a) 
variations in п are least effective in reducing standard errors of 
estimate, and (b) for normal normative distributions, variations in 
К are most effective with variations in Ё being most effective 
for skewed normative distributions. Shoemaker (1971) reported 
that variations in & were most effective in reducing standard 
errors of estimate in multiple matrix sampling for both normal 
and negatively-skewed normative distributions. In this prior in- 
vestigation, however, tk exceeded K for all sampling plans, 
tk was not an integer multiple of K, and items were not distributed 
with equal frequency across subtests. It is hypothesized that, 
had the assignment of items to subtests been controlled more 
rigorously, variations in k may have been more effective in 
reducing standard errors of estimate for normal normative dis- 
tributions and more in line with the results reported herein. 
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NOMOGRAPHS FOR THE SIGNIFICANCE OF THE 
DIFFERENCE BETWEEN PERCENTAGES? 


M. REEB 
Bar-Ilan University, Israel 


Lawsue and Baker (1950) present a nomograph for the signifi- 
cance of the difference between uncorrelated percentages, based on 
the “development of a statistic which is a function of p and which 
has a constant standard error dependent only on the size of the 
sample.” This nomograph, Figure 1 in the present paper,? is con- 
venient to use in that it provides, directly from pı and ps, values 
for the statistic о (omega) for substitution in 


BNET? 
= NN. + Na 


the calculation being completed arithmetically.® The first part of 
this paper suggests the use of a companion nomograph providing 
directly that part of (1) which is a function of №; and №. This 
function, defined as v (nu), gives rise to 


t= о, (2) 


which is more easily calculated than (1). 
The nomograph is based on the geometric property that in a 
right-angled triangle of lesser sides N, and Ns, if a line be drawn 


1The publication of this material has been assisted by a grant from the 
Research Fund of Bar-Ilan University. 

2 Thanks are expressed to Prof. С. Н. Lawshe for permission to use the 
nomograph. In order to increase accuracy, each 1 division on the w scale has 
been sub-divided into 5 parts, each of .02. 

з Appel (1952) has presented a set of nomographs to derive ¢ for the same 
сазе, Almost no computation at all is required, but their use involves more 
steps and is more complex than in the present procedure. 
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bisecting the right angle to meet the hypotenuse and its length is 
d, then 


МУ, 
ава лета P 
v having been defined as V2N,N2/(N, + Nz); (3) gives 
v= үү? ш. @ 


А nomograph can now be drawn with N, and N, as intercepts on 
the vertical and horizontal axes respectively, and the required value 
of у will be a simple function of а, the length of the intercept on the 
diagonal bisecting the right angle.“ In practice, the v scale in Figure 2 
was calibrated by taking the special case N, = Ма = М for which 
у= VN add = N/V. 


Use of the Nomographs to Obtain t 


Set a straight-edge, preferably transparent, in Figure 1 at the 
appropriate values of pı and p», and read w. Similarly, in Figure 2, 
from the values of N, and №» read v; the upper sides of the Ni, Na 
and v scales are calibrated for Ni, № = 0 — 100, and the lower 
for Ni, № = 0 — 1000. (These two sets of scales are to be used 
separately, i.e., values of М less than 100 from one set and greater 
than 100 from the other may not be used together.) Multiply о by v 
to obtain t. If о is very small, ог | is close to the required signifi- 
cance level (either below or above it), it is probably better to 
calculate # by the usual formula. 


Nomographs for Critical Omegas 


A possible extension of this device is suggested by two circum- 
stances. Firstly, as df's increase, the values of required for sig- 
nificance at given probability levels tend rapidly towards asymp- 
totes, as shown by the typical values in Table 1. 

Consequently, where # changes very little, it becomes possible 
with small loss to calibrate the diagonal in Figure 2 not only аз 


4 One of the steps in the Appel (op cit) procedure involves the derivation 
of №, defined as 1/N; + 1/Ns, from scales for N, and Ns; this might have been 
modified to derive » as defined above. However all these scales narrow very 
pee БЕТ as N increases, and it is thought that Figure 2 of this paper is 
preferable, 


# 


М. ВЕЕВ 


А 
ш 
2.00 š 
5 
1.50 
10 
15. 
1.00 20 
30 
.50 : 
40 
0 50 
60 
.50 70 
80 
1.00 85 
90 
1.50 95 
2.00 
100 
Figure 1. 


previously in v, but directly in critical minimum values of w, given 
by 


Onin = : , (5) 


100 


1000 
43 
10 Момо | 
| 
М 0>1000 | 


? 
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TABLE 1 
Values ој t Significant (two-tailed) at .05 and .01 for Various df's 


df's 25 50 100 400 1000 


t required for significance 
p.05 2.060 2.008 1.981 1.966 1.962 
p.01 2.787 2.678 2.626 2.588 2.581 


t being taken at its highest value for a selected range of df’s at a 
given significance level. 

Secondly, even when £ changes more rapidly, advantage can be 
taken of the fact that in using Figure 2, some information on the 
number of cases, №, + Ns, and therefore on the number of df's, 
Ni + Ма — 2, for a given comparison has become available. The 
intercept on the diagonal corresponds to a certain minimum number 
of df's, represented by the case №, = Na at that particular intercept. 
Consequently again as in (5), but separately for each point on the 
diagonal, critical minimum values of o, corresponding to the appro- 
priate minimum value of £ required for significance divided by the 
fixed value of у at that point, can be calculated. Some: loss of 
power is entailed by this procedure, growing less аз №; and № tend 
to equality and as df’s increase, in accordance with, for example, the 
values of £ in Table 1. 


Evaluation of Loss of Power 


The loss of df's due to the inequality of N, and N, can be evaluated 
in the following way. For a given intercept on the diagonal the 
minimum sample size corresponds to the case N, = Ма = d/cos 45° = 
v2 d, d being defined as previously. Substituting in this the value 
of d given by (3) above, and doubling, the effective minimum sum 
of N, and №, is, therefore, given by 4(N,N;/N; + Nz). Since the 
actual sum is, of course, №, + М», the relative utilization of df's 
by this procedure із 4[М,М№,/ (№, + N2)*] (neglecting the 2 degrees 
of freedom lost in both numerator and denominator). Denoting 
М/А, by г, we have this conveniently expressed by 4r/(1 + 1)’, 
which is evaluated for various values of r in Table 2. 

Bearing in mind the nature of the variation in f, it emerges that 
critical values of w are not very seriously affected even for small and 
quite unequal numbers of cases in the two samples. For example, 
for М, = 10, М, = 40, there are 48 арв, which for р = .05 requires 
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TABLE 2 
Relative Utilization of df’s for Various Values of r 
т 1 2 3 4 6 8 10 
ee 
Rel. Utilization 1.00 .89 ‚75 .64 .49 .40 .33 


a t of 2.010. For the intercept on the diagonal, the number of effective 
ара is given by 4У УМ, + Na — 2 = 30, for which the required 
113 2.042. From (5), and > being uniquely determined at each point 
because of the geometrical properties of the nomographs, the Te- 
quired minimum value of о (or f) has increased by .032/2.010 or 
1.6% only. For .01 significance, the increase is (2.750 — 2.681)/2.681 
or 2.6%. Similarly for Ns of 20 and 100, wmin increases by .8% and 
1.1% for .05 and .01 significance respectively, for 50 and 150 by 
.29 and .3%, and for 200 and 500 by about .1%. 
It must be emphasized that the error introduced is always of 
Type II, i.e., in the conservative direction, accepting the null hy- 
pothesis when it should be rejected. 


Calibration 
As before, the nomographs were calibrated by using the special 
case of N, = №, = N for which = УМ and d = N/v2. 
Consequently (5) becomes 
Onin = Ti (6) 
where £ takes the appropriate value for 2 М — 2 degrees of freedom. 
Values of £ for p = .05 and .01 (two-tailed) were taken from 
Table V in Edwards (1968) (which is rather more detailed than 
most such tables) to three decimal places for М = 2 — 200, cor- 
responding to 2 — 398 df's, with graphical interpolation where 
necessary. For N more than 200, the variation in ¢ does not dis- 
cernibly affect the calibration, and appropriate maximum ёв were, 
therefore, taken as applicable for the rest of the range of the 
nomographs, up to N —1000. For convenience, two nomographs were 
drawn, separately for ranges of N from 0 — 100 (Figure 3) and 


0 — 1000 (Figure 4). 
Use of the Critical Omega Nomographs 


Use Figure 1 to determine, as previously, the value of о for m 
and рз. If №, and Nz are less than 100, use Figure 3, if less than 


y == 


MURS 1000 
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1000 Figure 4, to determine the critical minimum value of emm 
required for significance at р = .05 or .01 (two-tailed): it is given 
by the intercept of the diagonal of the line connecting the appro- 
priate values of N, and №. If exceeds wmm, the difference in 
percentages is significant at the given level. If is not much differ- 
ent from wmm, it is probably better to use Figure 2 and obtain t. 
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THE DISTRIBUTION OF TEST SCORES! 


WILLIAM A. SCOTT 
University of Colorado 


For a test composed of dichotomous items, it is well known that 
the distribution of total scores depends on the number of items, 
their difficulties, and their intercorrelations (see, for example, 
Nunnally, 1967). There are differing points of view concerning the 
ideal shape for a score distribution, some preferring a normal dis- 
tribution (Cronbach, 1960, р. 135), some a rectangular distribution 
(Ferguson, 1949; Guilford, 1954, p. 360; Humphreys, 1956), others 
a distribution that accords with true scores (Lord and Novick, 
1968). Without taking a position on this matter, one may still in- 
quire as to the conditions under which any desired distribution 
may be obtained. 

A normal distribution may be achieved most directly by writing 
uncorrelated items of equal difficulty. However, a zero interitem 
correlation would imply that the items did not measure a common 
trait, thus making a summative scoring procedure illogical. In ac- 
tual practice, normally distributed scores are rare; commonly the 
distribution is flatter than normal. 

The distribution of test scores may be made to approach that of 
а dichotomous criterion by writing items with somewhat less ex- 
treme p values than the criterion. The more valid the items, the 
closer should their ps approach that of the criterion (see Lord, 
1952, 1953). 

There is presently no consensus concerning the best method of 


=, 

1 The empirical analyses reported in this paper were supported by the Uni- 
versity of Colorado's Council on Research and Creative Work. I am indebted 
to Constance Farson, Dorothy Monk, and Ruth Scott for their contributions 
to these analyses. 
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obtaining a rectangular score distribution. Nunnally (1967, p. 145) 
suggests using items with intercorrelations around .40; Guilford 
(1954, pp. 360-361) suggests intercorrelations around .50; Hum- 
phreys (1956) specifies г = 14 and р = .50; Guttman (1944, 
1950) and Loevinger (1947, 1948) have proposed writing items, of 
widely varying difficulty, that are intercorrelated as highly as 
possible. 

The problem with Guttman’s approach is that if the nontrait 
variance of items is due only to differing ps, this is likely to result 
in scales of rather trivial and redundant content. If the content of 
the items is allowed to vary much, along with their ps, their inter- 
correlations are likely to fall substantially below an optimal level, 
so that the score distribution is not rectangular and considerable 
error of measurement occurs at the extremes. 

Neither Nunnally nor Guilford indicates what level of р is as- 
sumed; presumably this would be about .50, but whether it should 
be constant or varying over the items is not stated. Actually, with 
constant item ps of .50, the optimal interitem correlation for yield- 
ing a rectangular score distribution is neither .50 nor .40, but 33, 
as indicated by Humphreys (1956) and by Formula 3, below. 

A rectangular distribution is one in which every score is obtained 
by the same proportion of subjects. The more unequal the propor- 
tions across scores, the less is the discrimination among subjects. 
If P, stands for the proportion of subjects who obtain score i, then 
one index of score discrimination may be constructed as: 


ик (1) 


D ranges from 1 to n + 1 (where n is the number of dichotomous 
items in the test), and may be interpreted as the number of items- 
worth of discrimination yielded by the test. 

1 When item difficulty (p) is constant at any level, and the item 
intercorrelation (r) is held constant at 0.00, „Р, the proportion of 


subjects scoring i out of т, is given by the famil 


у iar binomial ex- 
pansion: 


n! k 
TEET им (2) 
where g = 1 — p. 
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The top row of Table 1 reports the index of discrimination, D, 
calculated from а binomial distribution with varying test lengths 
and varying item ps. It is apparent that D is substantially smaller 
than its maximum (n + 1), especially for long tests. 

When the (constant) inter-item correlation is not equal to zero, 
the proportion of subjects obtaining any particular score must be 
calculated from a more complicated formula? 


^ (@+а+{-14+р+п—4-1) 
Ри Е 
i! (n — 9! 
ee (@+4-1,1+>-1) 
The second factor оп the right represents the probability of passing 
a particular combination of i items out of a total n. It is a ratio of 


two beta functions, each of which may be expressed as a ratio of 
gamma functions: 


(3) 


еее ьЯ рат 1] 
сено он 
d rl +n - 1) 


TABLE 1 


Indices of Dispersion (D) for Tests of n Items With or Difficulty (р) and Constant Interitem 
ul 


n 10 20 40 
p: .50 .40 .30 30 50  .40 .30 
(.60) (.70) (.60) (.70) (.60) (.70) 
г: .00 5.68 5.55 5.18 7.98 7.81 7.29 11.25 11.02 10.29 
10 8.03 7.79 7.01 13.98 13.57 12.27 25.68 24.93 22.56 
+20 9.86 9.37 7.82 18.26 17.37 14.51 34.96 33.29 27.84 
.30 10.92 9.99 7.53 20.79 18.90 13.70 40.53 36.71 25.80 
.40 10.67 9.36 6.50 20.10 17.12 10.85 38.81 32.12 18.54 
.50 9.09 7.82 5.30 15.88 13.01 7.88 28.35 21.98 11.83 
.60 6.98 6.04 4.21 10.75 8.89 5.62 16.71 13.12 7.50 


. 00 . 6,98 6.06 4.31. 10.76 | 8.88 = n — 


photocopies. Make checks payable to: CCMIC-NAPS. I am indebted to 
Professor Hyman Kestleman for assistance in this derivation. 
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1) 


Та expanded form, Formula 3 may be written as: 


oP +o-1,2+p-1) 


! i-2 n-i-2 n-2 1 
Pond dione Ц G9 Пг 
Example 


If = .50 and г = 14, then the probability of passing 2 out of 3 
items may be calculated as: 


3! 5 (50) ғ 
Ps = arg ty TT (50 + 3 +3 


a--1 


Ai 9d ir. 


(3) { 50+ (.50)/3—1/3} { 50+ (.50)/3} [.50-- (.50)/3—1/3} 
if (.—1/3)) (14-13) 


= .25 
Formula 3 does not apply to the following cases: 


If i=0,,P, = I| ttre 
iii sr 


n-2 
If i=n,,P, = ПРЕ =. 
== 1+ 
Ил = 1.00, „Ро = q and „Р, = p; all other „P; = 0. 
If = 0.00, all „Р; are computed from Formula 2, 
Indices of discrimination have been calculated for several repre- 
sentative values of п, p, and г. These are reported in Tables 1 and 
1а.3 Maximum discrimination for every n occurs when р = .50 and 


— 
3 Other indices of discrimin: 


ation have also been computed. One of these, 
DF, adapted from Ferguson (1949) is: 


DF =1- $p; 
*-0 
Another is the well-known index of information, H (see Attneaye, 1959): 


H- E Pr log, T. 
4 


WILLIAM A. SCOTT 729 


г = 33. Аз р departs from .50 (in either direction), the optimal 
value of r decreases. When 7 is optimal for p, a given increase in n 
will augment D maximally. But if r departs in either direction 
from its optimal value for a given р, the increment in discrimina- 
tion achieved from lengthening the test is reduced. 

For long tests, a small increment of г in the lower range (e.g., 
from .00 to .05) yields a much greater proportionate increment in 
D than the same increase in г would yield for a short test. This is 
mainly because the normal distribution (approximated by a long 
test with uncorrelated items) provides such poor discrimination 
among subjects. Equivalent increments in r at moderate ranges 
(e.g, from .20 to .80) yield fairly equivalent (proportionate) 
changes in D, for all values of n. Beyond the optimal level of r, the 
longer the test, the more will D decrease proportionately for a 
given increase in 7. 

The typical psychological test aimed at a single personality 
characteristic (e.g., a particular ability, motive, or attitude) is un- 
likely to show a mean interitem r larger than .20—usually closer to 
.10. Therefore, improved discrimination among subjects over the 
entire range of test scores will ordinarily be achieved by increasing 
the homogeneity of the test. (This will not necessarily lead to 
an increase in test validity, for increased test homogeneity can 
often be achieved only by confounding content with irrelevant 
methods variance—for example, by including many items with 
very similar wording—or by defining the domain of content so nar- 
rowly that it does not correspond to any domain for which natural 
behavioral criteria can be defined. So the present paper should not 
be read as an indiscriminate recommendation for raising interitem 
correlations, or even for aiming at rectangular score distributions. 


These alternative indices yield distributions very similar to the distribution 
of D, when n is small (except that DF ranges between 0 and 1; H between 
0 and log, п). For п = 5, the product-moment correlations of D with DF 
and Н are both .96; for п = 40, they are 82 and 87, respectively; for n = 100 
they are .75 and .83. All of the characteristics of Tables 1 and 1а described in 
the text hold for comparable tables constructed from DF and H, with the 
Provision that increments in DF must be interpreted as proportions of the 
distance to its upper limit. D is preferred to these two alternative indices 
because its upper limit equals the test length (+1) and it may be interpreted 
intuitively as the number of items-worth of discrimination yielded by the 
test; or, more precisely, as 1 plus the number of dichotomous items that would 
be required to yield the same level of discrimination if all p — .50 and all 
interitem r = №. 
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It merely indicates with some precision how overall inter-subject 
discrimination is likely to be affected by specific changes in p and 
r. Whether or not one wishes to make such changes will presumably 
depend on their implications for test validity and on the popula- 
tions to which the test is to be applied.) 

Empirical Distributions 

Tt is to be expected that departure from constant p and r would 
tend to depress the empirical discrimination index below its theo- 
retically expected value, except when the mean r exceeds its optimal 
value for the given level of p. In the latter case, an appro- 
priate distribution of item difficulties could increase the discrim- 
ination of the test. In order to ascertain the degree of correspon- 
dence between theoretical Ds and Ds actually obtained when p 
and r vary, & large number of item sets were constructed from data 
previously collected by the author and by Stuart Cook. These in- 
cluded responses of General Psychology students to the Allport- 
Vernon-Lindzey (1960) Study of Values, Edwards’ (1953) Personal 
Preference Schedule, and some value scales constructed by Scott 
(1965, pp. 249-257). These tests were administered in both single- 
stimulus and forced-choice format to three different samples of 
subjects (see Scott, 1968). Additionally, Scott’s value scales were 
administered to а sample of General Psychology students at 
the University of Colorado and to a sample of entering freshmen 
at Dakota State College (Madison, South Dakota). Professor 
Cook kindly provided data from preliminary versions of his Multi- 
factor Racial Attitude Inventory (see Woodmansee and Cook, 
1967). Altogether, there were nine different data pools, each con- 
sisting of between 100 and 400 subjects’ responses to between 120 
and 240 items. 

From these data pools various sets of items were selected in such a 
way as to: (a) minimize overlap between item sets (no two sets of 
the same size overlapped by more than 50% of their items); (b) 
maximize inter-set differences in mean p, mean interitem r, standard 
deviation of ps, and standard deviation of interitem rs. Thus, it was 
hoped to test the applicability of the theoretical distribution to a 
range of test lengths, under varying degrees of departure from the 
conditions of constant р and constant interitem т assumed in For- 
mula 3. The purpose of minimizing overlap between sets was to re- 
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duce the degree to which the results depended on a particular 
sample of items with their peculiar distributions of ps and inter- 
item rs. 

The empirical investigation was limited to item sets of sizes 5, 
10, and 20 because larger sets would have required more than 400 
subjects to stabilize D, and there were very few large sets with a 
high mean r. Within these limitations, it is probable that the ranges 
of mean р, mean т, and intra-set standard deviations of р and т are 
considerably wider than one is likely to encounter in ordinary psy- 
chological inventories. 

For the 5-item sets, samples of at least 100 subjects were used; 
for the 10- and 20-item sets, samples contained at least 200 sub- 
jects. For every set of items, an empirical index of dispersion, D, 
was calculated from the actual distribution of subjects’ scores. In 
addition, a theoretical D was computed from the distribution of 
„Р; calculated from Formula 3, under the assumption that p and 
r were constant at their mean values for the item set. The degree to 
which the theoretical Ds approximated the empirical Ds for sets of 
а given size (5, 10, ог 20 items) was represented by the intraclass 
correlation, p (Fisher, 1941), between these two measures over all 
sets of that size. Unlike the product-moment correlation, p is re- 
duced by any discrepancies between means and standard devia- 
tions of the two distributions being correlated. It is a measure of the 
accuracy of prediction, rather than of the linear relation between 
predictor and criterion. 

The intraclass correlations, together with the numbers of sets 
from which they were computed, are presented in the first column 
of Table 2. These represent fairly good correspondence between 
theoretical and empirical indices of discrimination when all sets 
are considered (.81, .88, and .90 for sets of size 5, 10, and 20 items, 
respectively). When the standard deviations of ps and rs within a 
set are both limited to .15, the theoretical approximation is sub- 
stantially improved. p = .97, .98, and .96, respectively, for sets of 
5, 10, and 20 items. When sets of all three sizes are combined 
(thereby introducing set size, n, as a stratified variable), the in- 
traclass correlation between theoretical and empirical D becomes 
97 for all sets and .99 for sets with ор and о, .15 or less. 

These empirical results indicate that the assumptions of constant 
р and constant interitem т used in developing Formula 3 do not 
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TABLE 2 
Intraclass Correlations between Theoretical and Empirical Indices of Dispersion 


Variability of p andir r 
icted <.15 


п (set size) Unrestri 

5 р = .81 .97 
(e: of sets —) (80 qu 

p= 5 
Go. of sets =) dis (93) 
ne .96 
(No. of sets —) dm (47) 
Combined р = .99 
(No. of sets —) QUAD) (437) 


limit its usefulness in predicting the discriminating power of typical 
tests in which item difficulties and interitem correlations vary 
moderately. 


Optimal r for Unselected Tests 


In 150 of the 1071 sets examined (14%), the empirically ob- 
tained D exceeded the theoretically expected value. All but nine of 
these were sets in whieh the mean r exceeded the theoretically 
optimal value for that set's mean p (ascertained from Table 1). 
(The nine exceptions were all sets of size 5 with mean r less than 
15.) Considering only the 278 sets in which the mean r exceeded its 
optimal value, 141, or 5196 of these, had empirical Ds that exceeded 
the corresponding theoretieal D, based on the assumption of con- 
stant p and constant т. Thus one is led to conclude that allowing p 
to vary is not а dependable way of increasing the discriminating 
power of a test, even when interitem correlations are high. 

The close correspondence between predicted and obtained values 
of D constitutes tentative justification for setting the optimal in- 
teritem correlation as indicated in Table 1—at about .33 for tests 
with item difficulties around .50, for example. Some direct empirical 
evidence about optimal r was obtained in the present study. Figure 
1 shows the mean indices of dispersion (D) calculated from sets, of 
а given size and given level of mean r, which had a mean p between 
80 and .70. Values along the abscissa indicate the midpoint of the 
range of mean rs represented by the plotted point. 

Though the lines for all three set sizes appear to reach maxima 
in roughly the same range, the optimal values cannot be specified 
precisely from these empirical functions. Significance tests com- 
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Figure 1. Empirically obtained mean indices of discrimination (D) ава 
function of test length and mean interitem correlation (all mean item diffi- 
culties between .30 and 70). 
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puted on sets of size 5 indicated only that sets in the range with the 
midpoint, т = 48, showed significantly lower discrimination than 
sets in the range with midpoint 43. Also sets with г in the range of 
23 were significantly less discriminating than those in the range of 
98. Between 28 and .43 попе of the mean Ds was significantly 
different from any other. 

For sets of 10 items, significance tests showed the optimal mean 
r to lie somewhere in the range represented by the interval mid- 
points .28 and .33, since mean Ds from these two groups were not 
distinguishable from each other, but each was significantly larger 
than the mean D calculated from sets in the immediately adjacent 
category (outside this range). 

For sets of 20 items, the range of mean т between 26 and .30 
yielded a significantly higher mean D than the 5-point interval be- 
low it, but the range of mean rs with midpoint 38 did not yield Ds 
significantly larger than any combination of sets with mean r above 
this range. (There were only 23 sets of size 20 with mean r above 
35, so the mean D indicated for that range in Figure 1 should not 
be regarded as dependable.) 

For sets of size 20, the optimal value of the mean т was thus 
found to lie somewhere above .25; for sets of size 10, somewhere 
between .25 and .35; and for sets of size 5, somewhere between .25 
and .45. It seems reasonably clear from these empirical data that, 
if one aims to construct a test with maximal discrimination (1.е., а 
rectangular distribution of scores), he would aim neither at a mean 
interitem correlation between .10 and .20, which is typical in cur- 
rent psychological tests, nor at a mean interitem correlation of .50 
as suggested by at least one author. 

When the mean interitem r is above its theoretically optimal 
value for a given mean р, distributing item difficulties around their 
mean may increase the discriminating power of the test in some 
cases. But this was just as likely to yield reduced discrimination in 
the sample of tests investigated here, so the practice is not to be 
recommended on the basis of these results. 
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THE EFFECT OF SCORING INSTRUCTIONS AND 
DEGREE OF SPEEDEDNESS ON THE VALIDITY 
AND RELIABILITY OF 
MULTIPLE-CHOICE TESTS! 
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RONALD K. HAMBLETON 
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In a recent study, Traub, Hambleton, and Singh (1969) com- 
pared the effects of two scoring instructions—one promising a small . 
reward for omitted questions, the other threatening а small penalty 
for wrong answers—on the performance of a multiple-choice vocab- 
ulary test. It was found that the reward instruction produced fewer 
incorrect answers, more omitted questions, and, with one qualifi- 
cation, higher reliability than the penalty instruction. The qualifi- 
cation was that the difference in reliability coefficients for the two 
instructions was significant at the .05 level when the performance 
was considered of only those examinees who indicated by their re- 
sponse to a posttest questionnaire that they correctly recalled the 
instruction they had been given for working the test. One purpose 
of the present investigation was to assess the generality of these 
findings using two tests, one of vocabulary, the other of mathe- 
matical reasoning. To make the study more comparable to previous 
research, a third instruction was also included. It informed the ex- 
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aminee that his score would be the number of correct answers, that 
he should try to answer every question and guess if necessary. 

This study had two additional objectives. The first was to com- 
pare the effect of different scoring instructions when the degree of 
speededness of the test administration was varied. This was accom- 
plished by administering the test with either a generous time limit 
or a restricted one. The second additional objective was to study the 
effect of different scoring instructions and different degrees of 
speededness on the criterion validity of test scores and their rela- 
tion to certain personality variables. This goal was achieved by 
correlating the test scores produced under each type of instruction 
and level of speededness with measures of school achievement, in- 
telligence, risk-taking, test anxiety and need for achievement. The 
results were compiled separately for each sex. 


A Theoretical Rationale and a Review 
of Some Related Research 


The theoretical rationale of this study is that guessing behavior 
on tests is affected by the type of scoring instructions employed 
and the degree of speededness of the administration, among other 
things. Now it is obvious that the smaller the amount of guessing 
that occurs on a test, the smaller the expected number of correct and 
incorrect answers and the larger the expected number of omitted 
questions. Reduced guessing should also result in increased reli- 
ability (Mattson, 1965) and increased criterion validity (Lord, 
1964). The foregoing argument provides a basis for choosing num- 
ber of correct and incorrect answers, number of omitted questions, 
reliability, and criterion-validity as indices by which to compare 
the effect of different scoring instructions and different degrees of 
speededness. 

A considerable amount of previous research has compared the 
effect of scoring instructions on mean test performance and reliabil- 
ity. However, the great bulk of this work has focussed on the pen- 
alty instruction and the instruction to guess. (For a summary of the 
results of such research, see Traub et al., 1969.) Only our work, 
summarized earlier, has dealt with the reward instruction in this 
context. 

A limitation of most previous research has been the almost total 
reliance of investigators on mean performance and reliability as the 
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indicators by which to compare the effect of scoring instructions. 
Very little attention has been paid to the effect of instructions on 
criterion validity. Ruch and De Graff (1926) found that scores on 
multiple-choice tests worked under the penalty instruction pre- 
dicted the criterion of performance on a free-response test of similar 
content better than scores on multiple-choice test worked under the 
instruction to guess. More directly relevant are recent serendip- 
itous results reported by Sax and Collet (1968). They found that 
scores on the Henman-Nelson Test of Mental Ability achieved 
under the reward instruction predicted the criterion of cumulative 
grade point average in college substantially better than scores on 
the same test achieved under the penalty instruction or the instruc- 
tion to guess. 

For the most part, previous investigators have also failed to pay 
attention to personality variables in their research on the effect of 
scoring instructions. This is true despite the fact that several per- 
sonality variables appear to correlate with test performance. For 
example, propensity for risk-taking on multiple-choice tests has 
been found to have small, but in some cases significant, positive 
correlations with scores on cognitive tests (Slakter, 1967; 1968) 2 
Test anxiety, as it is measured by questionnaire, is frequently ob- 
served to have small, but significant, negative correlations with 
test scores (for a review of this research, see Ruebush, 1963 and 
Sarason, 1960). Objective measures of need-for-achievement also 
may be positively correlated with achievement on multiple-choice 
tests (Russell, 1969). Findings such as these create a problem for 
the tester in that his tests may be said to discriminate unfairly 
against examinees who avoid taking risks, are anxious, and lack a 
strong need to achieve. This follows because examinees such as 
these may not attempt questions that their risk-taking, secure, 
high need-for-achievement friends would answer. It seems apparent 
that the best scoring instruction is one that would produce test 
scores having low correlations with personality characteristics such 
as these, in addition to having high reliability and criterion 
validity. 


ESE 

uz Recent research by Slakter (1969) indicates that the correlation between 
risk taking and test performance may be negative in some instances, although 
it is БА just what determines whether the relationship will be positive or 
negative, 
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There is some reason for believing that the correlation between 
personality characteristics and test scores is affected by the type of 
scoring instructions used on the test. Sherriffs and Boomer (1954) 
found achievement test scores obtained under penalty instructions 
had a significant negative correlation with scores on the A scale of 
the MMPI whereas the corresponding correlation for test scores 
obtained under the instruction to guess were near zero and non- 
significant. (Persons high on the A scale are said to be introverted 
and anxious, among other things.) 

It may appear from the results of Sherriffs and Boomer that the 
most desirable instruction is the one to guess. Certainly, if exam- 
inees differ in their propensity to take risks, their level of test anx- 
iety, or their need for achievement, and these differences are re- 
flected in their tendency to respond, then telling everyone to guess 
and to answer every question might be expected to eliminate the 
effect of the differences. It should be noted, however, that this solu- 
tion to the problem is achieved at the expense of increased guessing 
and therefore decreased reliability and criterion validity. More- 
over, it seems that simply telling examinees to guess does not, in 
fact, result in everyone answering every question. At least for 
eighth- and ninth-grade students, the instruction to guess has been 
associated with substantial variation in the number of questions 
students omit (Sabers and Feldt, 1968; Traub et а]., 1969). Thus, 
even the instruction to guess may not be effective in cancelling the 
effect of personality characteristics on test performance. Additional 
empirical evidence is clearly required to choose which of the reward 
instruction, the penalty instruction, and the instruction to guess is 
most effective in reducing the effect of personality on test per- 
formance. 


Method 


Subjects 


The subjects for the study were 1091 eighth-grade children (549 
girls) from 26 schools in the Oakville, Ontario school district. 
This group comprised all the eighth-grade children in the district 
except those who were absent from school on the day the two tests 
involving different scoring instructions and degrees of speededness 
were being administered. 
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Personality, Intelligence and Achievement Measures 


The description of the measures used in the study and how they 
were administered is presented in two parts. In this part, we de- 
scribe those measures not directly involved in the experimental 
phase of the study, that is, the phase in which the scoring instruc- 
tions and the speededness of the administration were manipulated. 

Risk-taking. This variable was assessed in the context of multiple- 
choice tests using the procedure devised by Swineford (1938). Two 
Measures were taken, one of risk-taking on a vocabulary test, the 
other of risk-taking on a mathematics test. The vocabulary measure 
utilized 78 items from Form A of the 90-item Dominion Vocabulary 
Test; the mathematics measure utilized the 50 items of Form 3A 
of the mathematics tests in the STEP series.’ Both instruments 
were pretested to ensure that the items in each covered a wide 
range of difficulty and that the average item difficulty was close 
to p = .50 so that each eighth-grade student would have at least 
some opportunity for risk-taking as he worked the questions. 

The risk-taking tests were given approximately one month before 
the other tests and questionnaires of the study. Both were admin- 
istered on the same day, the mathematics test before and the vo- 
cabulary test after the luncheon break. The students were carefully 
coached in the Swineford procedure for working the test. Sufficient 
time was available so that every student, no matter how slow, was 
able to attempt every question. Motivation to work these tests, as 
well as the others administered in the study, was provided by in- 
forming the students that their scores would be reported to them 
and their teachers. Examinees recorded their answers to each ques- 
tion and the number of marks claimed for each answer—this latter 
response is required in the Swineford procedure—on machine scor- 
able answer sheets. Risk-taking scores were computed using a 
modification of Swineford’s formula, the modification being nec- 
essary because we employed four-option (mathematics) and five- 
option (vocabulary) multiple-choice items instead of true-false 
items. 

Test anxiety and need-achievement. These variables were mea- 
Sured by questionnaire. Anxiety was assessed using a slightly modi- 


کک 
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fied version of the Test Anxiety Scale for Children (Sarason, 
Davidson, Lighthall, Waite and Ruebush, 1960). (Three items of 
the scale were removed because they had been judged unsuitable 
by school personnel.) Need-achievement was measured using an 
instrument devised by Russell (1969). The questionnaire composed 
of the items from these scales was administered after the students 
finished working the tests given under different scoring instructions 
and degrees of speededness. A test anxiety score and a need achieve- 
ment score were derived for each student by keying his question- 
naire responses in the manner prescribed by the authors of the 
instruments. 

Intelligence and school achievement. School records were entered 
to obtain, if present, the most recent IQ score for each student and 
his final marks in English and mathematics in grade seven. Several 
weeks after the conclusion of the experiment, the eighth-grade 
marks in English and mathematics became available. These were 
recorded for all except a few students who had left school sometime 
after the experiment had been completed. 


Experimental Administration of Two Aptitude Tests 


This section describes the experimental phase of the study. Two 
tests were developed for use in this phase, a vocabulary and a mathe- 
matical reasoning test. Hach consisted of two separately timed 
parts. The parts of the vocabulary test each contained 20 synonym 
items whereas the parts of the mathematical reasoning test each 
contained 15 problems. The source of the items was the Kit of 
Reference Tests for Cognitive Factors (French, Ekstrom and Price, 
1963), with the vocabulary items being drawn from the tests mea- 
suring the Verbal Comprehension Factor and the mathematical 
reasoning problems being taken from the tests measuring the General 
Reasoning Factor. Items were selected on the basis of pretest in- 
formation according to the following criteria: Each item in one part 
of the test had to match the corresponding item in the other part 
in terms of difficulty (-=.02 was the allowable discrepancy in the 
value of the percentage passing corresponding items); the range 
of difficulty of the items in each part was from p = .10 to p = .90 
with a mean slightly less than р = .50. This way of constructing 
the tests ensured reasonably parallel parts and a range and mean 
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level of difficulty such that all students would be expected to еп- 
counter some items that could be answered only by guessing. 

The three scoring instructions used in this study were similar to 
the reward, penalty, and “guess” instructions used by Traub et al. 
(1969). Only the guess instruction was substantively different, 
inasmuch as in this study it emphasized answering every question 
in addition to guessing whenever necessary. 

The time limits used for administering the vocabulary and 
mathematical reasoning tests were determined on the basis of pre- 
test results. It appeared that ten minutes per part of the verbal 
test and 20 minutes per part of the mathematical test would be 
sufficient to enable 95 per cent or more of eighth-grade students to 
complete the part. These times were selected for the power condi- 
tion. The speeded condition was created by allowing only four min- 
utes per part of the vocabulary test and ten minutes per part of the 
mathematical reasoning test. These times were chosen in the ex- 
pectation that approximately 50 per cent of subjects would be un- 
able to finish each part. 

Students were assigned to one of the six experimental conditions 
(three scoring instructions x two degrees of speededness) using a 
stratified random sampling procedure. The stratifying variables 
were schools, sex and intelligence. Students within each school 
were divided on the basis of sex. The members of each sex were 
then rank-ordered on the basis of the IQ scores drawn from school 
records. Clusters of six students were taken beginning at the top of 
the rank-ordered list for each sex. Students within each cluster were 
randomly assigned to one of the six experimental conditions. 

The administration of the verbal and mathematical tests was 
carried out in at least two rooms per school. Within each room all 
students worked the tests under the same degree of speededness but 
approximately one-third of the students were assigned to each 
scoring instruction. This latter restriction meant that the scoring 
instructions had to be read silently by the students. Despite the 
limitation inherent in such a procedure, it had been found effective 
in our earlier study (Traub, et al., 1969). The directions about 
speededness were read aloud. Students in the speeded group were 
encouraged to work as quickly as possible whereas students in the 
Power condition were informed that they had ample time and did 
not need to hurry. Responses to the test items were recorded on 
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machine scorable answer sheets. Each part of each test was scored 
separately but, except for the reliability analysis, total test scores 
consisting of summed part scores were used in the data analysis. 
Posttest questionnaire. Several questions were asked of the stu- 
dents after they had finished working the tests. The purpose of 
these questions was to find out if the students could recognize the 
instructions they had been given for working the tests. In addition, 
several questions dealt with the effect of the instructions on the 
approach the students adopted in working the tests and whether the 
students regarded the tests as fair measures of their knowledge. 


Results 
Test Performance 

Statistics are reported in Table 1 that summarize, in terms of 
correct, incorrect, omit and formula scores, the performance of the 
12 experimental groups on the vocabulary and mathematical rea- 
soning tests. Each score for each test was analyzed. separately 
using a three-factor analysis of variance. The factors were scoring 
instructions, degrees of speededness and sex, each factor being re- 
garded as fixed. Equal cell frequencies for the analyses were 
achieved by discarding cases at random where necessary. The cell 
frequencies were 73 for the analyses involving scores on the vocab- 
ulary test and 84 for the analyses involving scores on the mathe- 
matical reasoning test. (The difference in cell frequencies for the 
two tests was caused by a mistiming of the speeded administration 
of the vocabulary test in one school.) 

Preliminary inspection of the data revealed a significant de- 
parture from homogeneous cell variances for the omit scores of the 
vocabulary test and the correct, omit and formula scores of the 
mathematical reasoning test. This condition was alleviated by ap- 
plying а logarithmie transformation to the correct and formula 
scores of the mathematical reasoning test. However, the same 
transformation was ineffective when applied to the omit scores of 
either test. In all cases, the conclusions supported by the analyses 
involving transformed scores were the same as those supported by 
the analyses of untransformed scores. Hence, we are reporting соп- 
clusions based on the analyses of untransformed data. 

The results of the analyses of variance are summarized in Table 
2 which, for significant main effects, contains the mean for each 
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level of the main effect, the associated F-ratio, and, in the case of 
the main effect for scoring instructions, an indication of which con- 
trasts of ordered means were found to be significant using the New- 
man-Keuls procedure. Several conclusions are clearly supported. 
One is that scoring instructions dramatically affect omissiveness. 
The reward instruction is more effective in producing omissive be- 
havior than the penalty instruction, а conclusion which holds for 
both tests. This fact is also reflected in the low mean number of cor- 
rect and incorrect scores of the reward group, but the difference be- 
tween the reward and penalty groups on these scores typically did 
not achieve statistical significance. As would be expected, the in- 
struction to guess produced the fewest omitted questions and the 
most correct and incorrect answers. However, even with ample 
time, not all students given the instruction to guess answered every 
question (cf. the results in Table 1 for the groups that worked the 
tests under the power condition and the instruction to guess). Also 
clear from Table 2 is the fact that the speededness of the adminis- 
tration was effectively manipulated in the study. Performance was 
poorer when the administrations were speeded than when they were 
unspeeded. The factor of sex was significant only on the mathe- 
matical reasoning test on which the boys outperformed the girls. 
This result is not surprising in view of the number of times it has 
been observed in previous research. 

The hypothesis of additivity of main effects was not rejected in 
any of the analyses involving scores on the mathematical reason- 
ing test. However, a significant two-way interaction between the 
effects for scoring instructions and degrees of speededness was 
found in the analysis of omit scores on the vocabulary test (F = 
6.29; df = 1,864; p < .01). The reason for the interaction is that the 
difference between the number of omitted questions under the 
power and speed conditions for both the reward and penalty in- 
structions was approximately equal and markedly larger than the 
corresponding difference for the instruction to guess. The interac- 
tion was ordinal in the sense that for all three scoring instructions 
more omits were observed under the speeded condition than the 
power condition. Besides this two-way interaction, significant triple 
interactions were observed in the analyses of the correct and form- 
ula scores of the vocabulary test (F > 5.40; df = 2,864; р < .01). 

The posttest questionnaire asked the examinees to identify the 
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instructions they had been given about scoring and the speededness 
of the administrations. Those who failed to recognize the instruc- 
tions they had received were culled and the scores of only those 
who correctly recognized the instructions were analyzed separately. 
Hereafter, these groups are referred to as the “reduced experimen- 
tal groups.” The results for the reduced groups were very similar 
to those for the total groups, and, therefore, are not reported. 


Reliability 


Reliability estimates are reported in Table 3 for the correct and 
formula scores achieved by each experimental group on the vocabu- 
lary and mathematical reasoning tests. Two estimates were com- 
puted for each score. The one without brackets is based on the 
scores of the total experimental group whereas the one in brackets 
is based on the data of the reduced group. The reliability estimates 
were derived by computing the correlation between scores on the 
two parts of each test and then adjusting it using the Spearman- 
Brown formula. This type of reliability coefficient was computed 
because it is appropriate for use with speeded tests and it should 
adequately reflect the differences that existed among the experi- 
mental groups in the amount of guessing done on the tests. 

The significance of differences among reliability coefficients was 
tested in the following way. The reliability estimates were treated 
as if they were zero-order correlation coefficients and were trans- 
formed using Fisher’s Z transformation. Each transformed co- 
efficient could then be regarded as the estimate of a parameter which 
is normally distributed with a known sampling variance (McNe- 
mar, 1962). This information is all that is required for each cell of 
an analysis of variance design in order to perform the analysis.* 
Consequently, a three-way analysis of variance was performed on 
the coefficients in each column of Table 3. Additional analyses 
were done in which the reliability estimates for the experimental 
groups instructed to guess were based on correct scores whereas the 
estimates for the other experimental groups were based on formula 
scores. These analyses seemed appropriate inasmuch as they com- 
pared the groups in terms of the reliability of the scores they had 
been led to expect from the test instructions. Because there were un- 


+The authors are indebted to D. F. Burrill for suggesting the analysis and 
from whom the procedural details are available (Burrill, 1970). 
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equal cell frequencies, the procedure for an unweighted-means 
analysis was applied (Winer, 1962). 

The analysis of variance results may be summarized as follows: 
For the vocabulary test, the only significant main effect was asso- 
ciated with the factor of speededness in the analysis of reliability 
coefficients for the correct scores of the total experimental groups 
(F = 5.01; df = 1,976; p < .05). Scores for the power condition 
were more reliable than those for the speed condition. In no case 
was the main effect for scoring instructions or sex found to be sig- 
nificant. This fact notwithstanding, it is interesting to note that the 
unweighted mean of the reliability estimates associated with the 
reward and the penalty instruction were, in all analyses, nearly 
equal and larger than the mean of the estimates associated with the 
groups told to guess. 

For the mathematical reasoning test, the main effect for scoring 
instructions was significant in every analysis except two (F > 
3.59; df = 2,1055; р < .05). Insignificant F-ratios for scoring 
instructions were obtained in the analysis involving the reliability 
coefficients for the formula scores of the total experimental groups, 
and the one involving the reliability coefficients for the correct 
scores of groups told to guess and for the formula scores of the 
other groups, again for the data of the total experimental groups. In 
all analyses, the highest mean reliability was associated with the 
reward instruction and the lowest with the penalty instruction, the 
difference proving significant in all analyses where the main effect 
for scoring instructions was significant. The mean reliability co- 
efficient associated with the reward instruction was found to be sig- 
nificantly higher than the coefficient associated with the instruc- 
tion to guess in all the analyses involving the data of the reduced 
experimental groups; in these same analyses, the mean associated 
with the instruction to guess was significantly higher than the mean 
associated with the penalty instruction. Degree of speededness was 
the other main effect for which significant F-ratios were observed 
in the analysis of reliability coefficients for scores on the mathe- 
matical reasoning tests. The mean reliability for the power condi- 
tion was significantly higher than the corresponding mean for the 
speeded condition (F > 5.57; df = 1,1055; p < .05). 

None of the interaction effects was found to be significant in 
any of the analyses of variance involving reliability coefficients. 
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Criterion Validity 


The results that relate to criterion validity are reported in Table 
4. Multiple correlations were computed between scores on the vo- 
cabulary and mathematical reasoning tests and each of five criteria: 
IQ, and school achievement in English and mathematics in grades 
seven and eight. (In Table 4, all coefficients for the groups told to 
guess are based on correct scores for the vocabulary and mathemat- 
ical reasoning tests; for all other groups, the coefficients are based 
on formula scores for the two tests. Also, the results in Table 4 are 
for the total experimental groups only. The results were not sub- 
stantially different when the validity coefficients were computed 
from the data of the reduced experimental groups.) 

Differences among the experimental groups in terms of the mag- 
nitude of validity coefficients were tested for significance using the 
analysis of variance procedure outlined earlier. In no case was & 
main effect associated with scoring instructions, degree of speeded- 
ness or sex found to be significant. And only for the validity co- 
efficients associated with the criterion of achievement in eighth- 
grade English was an interaction term found to be significant; 
degree of speededness interacted with sex. For boys, the best predic- 
tion of eighth-grade English achievement came from tests adminis- 
tered under the power condition whereas for girls, the best prediction 
was provided by tests administered under the speed condition. 

Although the factor of scoring instructions was not observed to 
have a significant effect on criterion validity, it is interesting to 
note that for every one of the five criteria the highest mean va- 
lidity coefficient was associated with the reward instruction. For 
three criteria, IQ and school achievement in English, the lowest 
validities were associated with the instruction to guess. The lowest 
validities for school achievement in mathematics were associated 
with the penalty instruction. 


Correlation with Personality Variables 


Table 4 also contains the multiple-correlations between the vo- 
cabulary and mathematical reasoning scores and four personality 
measures: risk-taking on a vocabulary test, risk-taking on a math- 
ematics test, need for achievement, and test anxiety. In all cases 
the correlations are small, and the differences that exist across ex- 
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perimental groups are not large enough to yield significant main 
effects or interaction terms in an analysis of variance. 

Table 5 presents the means and standard deviations of scores for 
each sex on the personality measures and an estimate of the reli- 
ability of each measure based on the total sample. These results 
are presented mainly to indicate that the personality variables 
had sufficient variance and were measured reliably enough so that 
the small magnitude of the correlations reported in Table 4 cannot 
be explained in terms of the restricted range or the unreliability of 
the personality measures. 


Questionnaire Responses 


A summary of the responses to some of the posttest-question- 
naire items is provided in Table 6. The break-down of responses is 
in terms of either the speededness factor or the scoring-instruc- 
tion factor as these were the dimensions on which substantial inter- 
group-variability was observed. The differences observed in the re- 
sponses made to these questions are what would be expected given 
the differences among the instructions for working the tests. Other 
questionnaire items dealt with such things as whether the student 
thought his test score would be a good index of his knowledge of 
vocabulary or his mathematical reasoning ability, and the type of 
test the student preferred. Experimental group differences on these 
questions were not large enough to be interesting. 


TABLE 5 
Statistics Summarizing Performance on Four Measures of Personality 
Risk-Taking Need for 
Vocabulary Mathematics Achievement Test Anxiety 

Males Mean 23.52 38.66 18.37 8.44 
SD 20.95 22.65 4.26 5.76 

N 505 495 530 530 
Females Mean 16.95 26.93 19.03 11.19 
SD 14.51 18.02 3.75 5.77 

N 493 487 541 541 
Tex . 843< .666* .650 . 886 

N 998 982 1071 1071 


„_ ® Odd-even split-half correlation. (This, then, is a reliability estimate of one-half the test. It is 
‘nappropriate to apply the Spearman-Brown formula to the correlation since the risk-taking scores 
are ratios and the sum of the ratios based on odd- and even-numbered items does not necessarily 
equal the ratio based on the total set of items. This linear relationship is assumed in the derivation 
of the Spearman-Brown formula. The reported correlation presumably represents а lower-bound 
estimate of reliability and it is likely that the reliability of the total risk-taking score would be 
appreciably higher.) 


* Odd-even split-half correlation corrected using the Spearman-Brown formula. 
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Discussion 
An objective of this investigation was to extend knowledge about 
the relative effects of two scoring instructions, promising a small 
reward for omitted items and threatening a small penalty for 
wrong answers. The results lend support to the assertion that the 
reward instruction more effectively encourages omissive behavior 
than the penalty instruction. This conclusion was advanced in our 
earlier study (Traub et al., 1969) and is generalized here to a dif- 
ferent content domain (mathematical reasoning) and a slightly 
younger age group (eighth as opposed to ninth grade students). 
Moreover, the reward instruction appears to be most effective in- 
dependently of sex or the degree of speededness of the administra- 
tion. (This latter assertion can be made because the interaction ob- 
served between scoring instructions and degree of speededness in 
the analysis of the omit scores on the vocabulary test was ordinal.) 
The present results also support, but less securely, the assertion 
that the reward instruction yields scores with higher reliability 
and criterion validity than the penalty instruction. The evidence 
bearing on reliability is strongest. At least for scores on the mathe- 
matical reasoning test, the reward instruction was associated with 
significantly higher reliability than the penalty instruction. It is 
unclear why a sizeable difference in reliability for the two instruc- 
tions was not observed for the vocabulary test, especially in view of 
the fact that we found significant differences using a vocabulary 
test in our earlier study (Traub et al., 1969). However, the vo- 
cabulary tests of the two studies did differ in relative difficulty and 
length: In this study, the group told to guess in the power admin- 
istration achieved, on the average, over 50 per cent correct answers 
whereas in our earlier study the group told to guess averaged only 
45 per cent correct answers. The more difficult the test the greater 
the possibility that a student will need to guess. Thus, the more 
difficult the test the more effective a scoring instruction can be in 
controlling guessing behavior, and the greater the effect it can have 
on reliability. Moreover, a long test such as we used in the previous 
study (90 items) would be expected to reveal more clearly than a 
short test any differences in reliability that may be attributed to 
the different scoring instructions, 
No conclusive statement based on the Tesults of this study can 


x 


TRAUB AND HAMBLETON 755 


TABLE 6 


оон» of Students in each Level of the Speededness or Scoring Instruction Factors Giving 
bu Each Response to Selected Questionnaire Items 
|: 


Speededness Factor 
Power Speed 
Question (N = 547) (N = 544) 
2» 1. Did the directions in the tests say anything 
= about how quickly you should work? 

(a) Work as quickly as possible ....... .04 -96 
(b) Take plenty of time to try every 

Question: «v vou + eres. SEL ВЕРЮ .88 .01 
[0 ем .08 .08 

4. What did you decide about how quickly you 
would work through the tests? 
(а) Work as quickly as possible ....... .18 .79 
(b) Work at usual зреед........... .69 .19 
(c) Work slowly and сагейШу........ .13 .01 
| 6. Was your decision about how quickly you 
would work the tests affected by the directions? 
(4) Yel. кои РУ Даре ПЕР НЕ .35 .56 
E. Apr UID gee oe .56 .35 
(o). 9 то е SN TO .09 .09 
Scoring Instruction Factor 


Reward Penalty Guess 
(N = 364) (N = 369) (N = 358) 


- Did the directions in the tests say how 
your score would be determined? 


.03 .03 .78 
.93 .00 .02 
.01 .94 .02 
.03 -03 19 
44 .66 95 
.31 .16 02 
07 .25 -18 03 
5. Was your decision to guess or not to guess 
affected by the directions? 
CER <n ee a .85 .36 35 
ПОРАКА з ОЕ ан .44 41 49 
о о T E RE .21 .24 15 
^^ Do you think you can improve your score on & 
multiple-choice test by guessing? 
Op Ура г, лв VM ST .25 .27 44 
(b) No со АНА 41 37 .30 


(е) Do пој know 
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be made about the effect of scoring instructions on criterion va- 
lidity inasmuch as statistically significant differences were not ob- 
tained. In part this failure to find significance may have been due 
to the fact that teacher marks were used for four of the five cri- 
teria. Teacher marks are often suspected to be contaminated by 
factors unrelated to the amount of knowledge a student has. 
Where such factors are operative, the validity of tests for predict- 
ing teacher marks will be attenuated. It would have been desirable 
to have as criterion information scores on free-response tests as well 
as teacher marks and intelligence scores. However, it is noteworthy 
that the reward instruction produced scores with the highest cri- 
terion validity coefficients on all of the criteria investigated. Also 
interesting is the fact that this finding corroborates the results re- 
ported by Sax and Collet (1968). 

An interpretation that may be given to the results of this study is 
that the reward instruction is more effective at controlling guessing 
behavior than the penalty instruction. Additional support for such 
an interpretation is provided by the fact that the instruction to 
guess resulted in significantly fewer omitted questions than the re- 
ward and penalty instructions. Moreover, the instructions to guess 
was associated with the lowest estimates of reliability on the vo- 
cabulary test and the lowest correlations with IQ and school 
achievement in English. All of these findings may be explained in 
terms of the increased amount of guessing that would be expected 
to occur under the instruction to guess. However, the foregoing in- 
terpretation is not adequate to explain the fact that the instruction 
to guess was associated with higher estimates of reliability on the 
mathematical reasoning test and higher validity coefficients with 
the criterion of school achievement in mathematics than the penalty 
instruction. We are at at loss to explain these anomolous results 
which resemble the findings of our previous study (Traub et al., 
1969). However, it should be noted that in neither this study nor 
our earlier one was the instruction to guess effective in getting all 
students to answer all questions. The “deficiency” in the instruction 
to guess may explain in part why it sometimes yields scores with 
higher reliability and validity than would be expected. 

In some respects, the most interesting results of this investiga- 
tion were the correlations between the vocabulary and mathemat- 
ical reasoning test scores and the four personality measures. The 


usi 
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fact that these correlations were not observed to vary consistently 
as a function of scoring instructions calls into question the recom- 
mendation of some writers that the instruction to guess be employed 
as a means of reducing the effect of individual differences in per- 
sonality on test performance (see, for example, Sherriffs and 
Boomer, 1954; Votaw, 1936). 

On balance, the findings of the present study suggest that it is 
preferrable to attempt to control guessing through the use of the 
reward instruction rather than to attempt to control it using the 
penalty instruction or to encourage it using the instruction to guess. 
This tentative conclusion seems to hold independently of the sex of 
the examinees or the degree of speededness of the test administra- 
tion. Of course, additional research using different tests and differ- 
ent types of students is required to assess the validity of this 
conclusion. 
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ON BOUNDS FOR THE AVERAGE CORRELATION 
BETWEEN SUBTEST SCORES IN IPSATIVELY SCORED 
TESTS 


LEON JAY GLESER: 
The Johns Hopkins University 


Ler a given psychometric test consist of n subtests. Let the 
scores on the various subtests be denoted by Xi, Хз, . . ., Xn. The 
test is scored ipsatively if ап individual's total score on the test 
is always equal to a given constant c; that is, 

х.+х,.+... +Х, = с 
for each individual. 

Ipsative tests are used quite commonly in personality testing, 
particularly in the form of forced-choice personality inventories. 
There is a certain amount of controversy as to whether forced- 
choice inventories and/or ipsatively scored tests provide greater 
validity than other forms of psychometric tests (L. J. Cronbach, 
1960; W. A. Scott, 1968). Part of the difficulty in adjudicating 
counter claims concerning ipsatively scored tests lies in the restric- 
tions that ipsative scoring places upon the correlations between sub- 
test scores. Some of these restrictions have already been mentioned 
in the psychometric literature (Н. Е. Priest, 1968; J. A. Radcliffe, 
1963; W. V. Clemans, 1966). s 

The present paper is concerned with the effect that ipsative 
Scoring has upon a commonly used index of between-subtest 
correlation—namely, the average correlation between subtests: 


n d > Èras (1) 


b Ek à 
1 Research supported by the Office of Naval Research under Contract 
NONR 4010(09) awarded to the Department of Statistics, The Johns Hopkins 
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where r;; is the coefficient of correlation between X; and X;. Whereas 
т for nonipsatively scored tests can range between —1/(n — 1) 
and 1, in the present paper it is shown that the possible range of 
values for 7 is 


Te EE 
т —1 n 


< Q 
Further, it is shown that one can always find (values for the) cor- 
relation coefficients r;; such that the upper bound for 7 is achieved 
regardless of what the variance of the subtests X; happen to be. 
The lower bound is always achievable when the subtests have a 
common variance. However, if two or more subtests have different 
variances, the lower bound need not be attainable by any choice 
of correlations 7;;. 

The proof of (2) is contained in the next section. Although the 
lower bound # > —1/(n — 1) has been conjectured many times in 
the literature, no rigorous proof for the ipsative case seems to have 
been previously published. Both the upper and lower bounds in (2) 
were conjectured to me by Julian C. Stanley and Marilyn D. Wang 


of the Department of Psychology, The Johns Hopkins University,. 


to whom go my thanks for suggesting this problem and for en- 
couraging me in my efforts to find a proof. 


Proof of the Inequality 
Let us adopt the following notation: 
c; = variance of X; û = 1,2, +--+, n, 
та = correlation between X; and X;, i > j, 
R = correlation matrix of X1, Xa, --- , X, = ((rij)), 
where we follow the convention that r;; = 1,7 = 1,2, --- , n. 

If the X,’s are not subtests of an ipsatively scored test (ie., if 
R is a general correlation matrix), then the quadratic form 1,’R1,, 
where 1, is the k X 1 column vector consisting of ones (i.e., 1, = 
(1, 1, 1, «++ , 1)), is non-negative. This assertion follows since all 
correlation matrices are positive semi-definite. But 


0<1,т1, = У) Уут, = п + n(n — 1), 


i=] j=1 


which establishes that the average correlation for nonipsatively 


University. This paper in whole or in part may be reproduced for any purpose 
of the United States Government. р x: 
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scored tests is bounded below by —1/(n — 1). The lower bound is 
achieved when r;; = —1/(n — 1), all = j. Clearly, since every 
та € 1, it follows that ? < 1. The upper bound is achieved when 
every X; correlates perfectly and positively with every X; (ri; = 1, 
alli > j). Thus, for nonipsatively scored tests, the average cor- 
relation between subtests varies within the range —1/(n — 1) to 1. 

Now, consider ipsatively scored tests. Let us fix the variances 
and covariances of X;, Xa, --- , Xn-1- Let o’ = (а, тз, *** , 0-1) 
be the 1 X п — 1 row vector whose elements are the (respective) 
standard deviations of X,, Xo, +*+ , X,-1. Let 


Ru = ((ru)), bj = 1,2, ++, n- 1, 


be the matrix of the correlations between X; and Х;; i, j = 1,2,---, 


n — 1. Then 
R 
iy iE 
qc 


where y is the column vector whose elements are the correlations 
between X, and the X/s, i = 1, 2, +++ , n — 1. That is, у = yi, 
Үз *** , Yna) Where 


y; = correlation between X; and X, 
=" +#=1,2,:::,8—1. 
Once о and Ry; are given, y and оз? are fixed by virtue of the ipsa- 


tive relation defining X, in terms of Xi, Хо, *** , Xna. It is not 
difficult to verify that (in vector notation) : 


1 
у = ote T 
Tt follows from (1) and (3) that 


nin — 1) = L/Rl,—7n 
= Ka Ва. + 211, y—-nt+1 (4) 
-1 Rug 
ТАР 19 15 За бис _ 1 
1 Aula Миа n+ 


Completing the square in (4) gives us 


г. 1 , 1 
nn — 1) = (in: = Т“) (Ie — aya Bac. °) — т. (5) 
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The argument we used above in proving that ? is bounded below 
by —1/(n — 1) in the non-ipsative case can also be applied in the 
ipsative case to prove the lower bound 7 > —1/(n — 1) (c.f., In- 
equality (2)). However, in the ipsative case it is not clear that values 
for the correlations and variances сап be found for which the lower 
bound in (2) can be attained. It is thus of interest to determine when 
(and if) this lower bound is achieved. If о; = 7,7 = 1,2, ++- ,n — 1, 
then 


с = ті, 
and from (4), 
n(n — DF = 1, Rul.a — 2V аб]; — + 1 
= (V lpi Rulni — 1}? — 

so that î? = —1/(n — 1) if and only if 1/_,Ri1,-1 equals 1. However, 
since c," = o/Ryo, it follows that when о = 71,-1, the condition 
у В 1,1 = 1 holds if and only if о, = т. Consequently, if all but 
one of the variances с; are known to be equal, then the lower bound 
in (2) can be achieved; and when the bound is achieved, all n of the 
variances c," are equal. 

For the case n = 3, the condition 1,'R1; = 1 can be met only 
when т = —$. However, when n > 4, the condition 1/_,Riil,-1 = 1 
can be met in many ways. For example, when n = 4, the condition 
can be met by т; = —3;7,j = 1, 2, 3, i ғ j (in which case ra = —$, 
$ = 1, 2, 3), or by та = та = —}, та = 0 (in which case, га = 
та = —}, та = 0). This situation is in contrast to the nonipsative 
case, where the lower bound is achieved if and only if all of the 
та = —1/(n — 1), i j 

Let us next consider the case when at least two of the variances 
of, i < n — 1, are unequal. From (5), we see that the lower bound 
in (2) is achieved if and only if 


; ) ( ; ) 
1-1 — Ral ln- — =0. 
( у с' В.с ee ита с' Вс у x © 
If Ёл is positive definite, then (6) holds if and only if 


1 
ја = ——— c. 
м o Ryo ў 


The above equality cannot hold since the elements of the vector on 
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the left-hand side are equal while, by the given, at least two ele- 
ments of the vector on the right-hand side must be unequal. Al- 
though it is possible that (6) is true when Ry, is singular and n > 8, 
we here illustrate that (6) is false when Ei is singular and n = 3 
(the case n = 2 is trivial). Only two cases of singularity are pos- 
sible for Ry; when л = 3, namely 


ва = (1 |) за в, - (5 1) 


In these cases, remembering that o^ = (ал, ог) and that оз ~ сз, we 
obtain 


(„ы уз ea 
o Вас Find с Ryo 

This argument shows that when three variances are unequal, the 

lower bound (2) cannot be achieved. 

An achievable lower bound is difficult to obtain. In the case n = 3, 
minimizing (4) with respect to Ri: (actually with respect to 712, the 
one free element in E41) requires solution of a cubic equation in 712. 
Cases in which n > 3 are presumably even more difficult to solve. 

We turn now to a proof of the upper bound in (2). First note that 


сату — ru) 20 (7 


a’ 
E В 
Tare etti оста аве 
since о; > 0, all û, and ту < 1, all i z^ j. From (7) it follows that 


с! quint Нея c'Ruc ul 
VoRyo то Ves Ме Ruo Ve Rue У 
or, since o 1,., = У! с; > 0, 


Cl 
Vo Ryo 


Now let е; be the n — 1 X 1 vector having ith component 1, and 
all other components 0. Note that 


21. (8) 


c (3 $ LA | 
V c Ryo ài i= V o Ryo Se e 
From (4) and (9), we have 
се El, 


n(n — iF = аб а — 2 $ 


[i V/a Ec mier: 
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в-1 c; т, 
€ Ma Каћа — 2 > Ve Rus (min e/Rul,1) ~n +1 


o'lan- у 
15 Bala eo Weng шш о) — n + 1. 


(10) 

From (10) and (8) we can conclude that 
n(n — DF € Ma Rulni — 2mine,/Rul,. п +1 
E (11) 

= шах (1,-. — eJ'/Eu(1.: — е;) — п. 
Now let R; be the п — 2 X n — 2 correlation matrix obtained 
from Ва by striking out the ith row and ith column. It follows 
from (11) that 


n(n — i) < max (a — е) Ви а, — e) — n 


(12) 
= max lj. El. — m. 
4 
But for any k X k correlation matrix P, 
k k 
ір, = У Бри <, (13) 


i=l 181 
since р;; < 1, all 5, j. Thus 
n(n — 1)F < (n — 2)®— n =" — 5n + 4 = (n — 1)(n — 4), 
ог 


<=, 4 


which is the bound we wished to establish. 

We note that regardless of the values of the variances oq, oa, *** , 
can-ı, the upper bound (14) is achieved whenever r;; = 1; û, j = 
1, 2, -.., n — 1. However, if such is the case, it follows from (8) 
that rı, = —1, i Æ n. Since the choice of indices is arbitrary, it 
follows that the upper bound (14) is achieved whenever т — 1 of 
the subtest scores are perfectly and positively correlated (have 
correlations 1), in which case the remaining subtest score is perfectly 
and negatively correlated with the other test scores. That this case 
should provide us with the upper bound to 7 is intuitively obvious 
(and motivated the conjecture), but, as we have seen, the fact that 
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the average correlation in the case when n — 1 subtest scores are 
perfectly and positively correlated is the maximum possible such 
average correlation requires a proof that is not trivial. 


Summary 

In the present note, it has been shown that the average correlation 
P between subtests of an ipsatively scored test is bounded below 
by —1/(n — 1) and above by (n — 4)/n, where n is the number of 
subtests. Further, it has been shown that one can always find (values 
of the) correlation coefficients r,; such that the upper bound (n — 4)/n 
is achieved. The lower bound is always achievable when all but one 
of the subtests have à common variance. However, if the subtests 
have different variances, the lower bound need not be attainable. 
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A NOTE ON THE RESTRICTION OF RANGE FOR 
PEARSON PRODUCT-MOMENT CORRELATION 
COEFFICIENTS 


LAWRENCE JAMES HUBERT 
University of Wisconsin 


Two recent papers (Stanley and Wang, 1969; Glass and 
Collins, 1970) have offered derivations {ог the limits on the 
range of a Pearson product-moment correlation coefficient 712 
given two other correlations r, and res. The present theoretical 
note is merely an attempt to point out a much more general 
result that is an immediate consequence of a well-known in- 
equality. A short and somewhat nonstandard proof of the basic 
theorem is presented in an appendix. 

Suppose [X;, ··• , Xa] is some set of n linearly independent ran- 
dom variables and У a single random variable. Let A denote the 
_ Variance-covariance matrix for [X;, *** , Xa] and G a column vector 
containing hypothesized covariances of Y with X;, ··· , Xa. Matrices 
А and G are assumed to exist. 


Theorem: The partitioned matrix 


s-[ G | 
G Var (У) 


is non-negative definite if and only if G'A3G < Var(Y). 

If A+ does not exist, then at least one of the random variables 
Ху, ++- , X, is a linear combination of the remaining n — 1 random 
variables. Suppose X; ean be defined as a linear combination of 
Xy, Xn, but the variances and covariances for Xs, --- , X, form 
an invertible matrix A’. The theorem then applies to А” and gives 
the inequality in (1). 


767 
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Cov (Ха, Y) d Cov (X;, Y) 
: (47> : < Var (Y) (1) 
Cov (X,, У) Cov (Х,, Y) 


With the inequality in (1) and the linear relationship for Cov 
(X;, Y) given below, 


Cov (Xa, Y) = У) c; Cov (X; Y) where Зах. =X, 
=2 i= 


an admissible region for G is again obtained. This general pro- 
cedure is easily extended to the case where more than one variable 
must be deleted to obtain an invertible variance-covariance 
matrix for the remaining variables. 

Suppose Var(X;) = Var(Y) = 1 and E(X,) = Е(У) = 0 for 
1 < i < n. The matrix А is then the intercorrelation matrix for the 
variables X1, *** , Xm, and G is a hypothesized column vector of 
correlations of Y with X;, -** , Xn. If A is nonsingular, the inequal- 
ity given in the theorem takes the simpler form: G*A-1G. < 1. ИА 
is considered as a “known” matrix, the inequality provides an 
admissible elliptical region for G in n dimensions. 

For an example the same bounds given by Stanley and Wang 
for the case n=2 will be derived. Suppose X, and X; are two ran- 
dom variables with correlation matrix А, Хз some third random 
variable, and G the column vector of correlations between X; 
and the other two random variables: 


Lanes Г. ML i 4 
та 1.0 Tas 
If ris = =1.0, the inequality in the theorem can be written as 

in (2). 

Кайа ШТ] De < 1 (2) 

Та | | Та 1.0] |та 
After performing the inversion and the necessary matrix multiplica- 
tion, the inequality in (3) is obtained. 


та? bns + та — та — 1 < 0 (8) 
If та = 1.0, (8) still provides a valid inequality, i.e., rs = тїз is 
obtained. Similarly, туз = —1.0 implies that rag = —r;s. 


| 
| 
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The limits ор 712 in terms of туз and тоз can be derived by factor- 
_ ing the left-hand term of the expression in (8): 


(га — тата — (0 — nj) — 123°))'”) 
X (та — tists + (1 — та) = 72) ^) <0 (4) 
Finally, the inequality in (4) holds if and only if the terms 


have alternating signs or are zero. This leads to the limits on 
та given in (5). 


тата — ((1 — P) — а) < та (5) 
та S тать + (1 — та)а — тыў)" 


If the problem is modified so that ту; is assumed to be the only 
"known" quantity, the theorem provides a complete elliptical region 
for тз and ras. Any point within the ellipse is a possible pair of 
correlations corresponding to т1з and ros. The ellipse intersects both 
the “rəs axis" and the “тз axis" at the points + (1—7122)12. If ria 
= 1.0, the ellipse degenerates to the line тїз = rss. Similarly, ri = 
—1.0 gives the line түз = —res. 


APPENDIX 
Proof of Theorem: (necessity) Since there are no linear relation- 
. Ships among the random variables X;, ::: , X,, the matrix A must 


be positive definite. For 1 < i X n + 1, define the subdeterminants 
d, of S as follows: 


Фа = det (S) 
Var (Xi) +++ Cov (Xi, X) 
d; = det : : ; issn 
Cov (X;, X) ··· Var (X) 
If Y is not a linear function of Ху, +-+, X,, then the matrix S 


is positive definite. A necessary and sufficient condition for S to 
be positive definite is for 4; to be greater than zero for all û, 
1 <i <n + 1. By using a simple relation for the determinant 
of a partitioned matrix, the equality in (6) is obtained (Rao, 
1965, p. 28). 


det (S) = det (A) (Var (Y) — G'A"'G) (6) 
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Since det (А) = d, > 0 and det (S) > 0,a strict inequality is 
implied: Var(Y) > G'A3G. Now, if Y is а linear function of 
Xy 5o, Xm then det (S) = 0 and an equality is implied: 
Var(Y) = 0:478. 

(Sufficiency). The inequality Var(Y) > G'A*G and the positive 
definite matrix A guarantee d; > 0 for 1 S i € n, and да > 0. 
Since S is non-negative definite if and only if d > 0,1 Si Sn + 1, 
the implication follows. 
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А NOTE ON COMPARING r-BISERIAL AND r-POINT 
BISERIAL 


JOHN BOWERS: 
University of Illinois 


Нон correlation has been reported (Engelhart, 1965; Aleamoni 
and Spencer, 1969) between r-point biserial and r-biserial item 
selection indices; the implication is that use of either index in typical 
item tryout and selection studies results in the same subset of 
Selected items. Using а quite simple criterion density function, 
one can show that there are conditions under which the two indices 
do not lead to the selection of the same item subsets. The maximum 
values of both r-point biserial and r-biserial = r-point biserial 
V PQ/h are dependent on item difficulty when the item validating 
Criterion is non-normal and especially so when it departs from 
Symmetry, as shown by Adams (1960). 
| А criterion density function defined over the range of X from 

0 to 1: . 

Y = (+ pX @) 
becomes more negatively skewed аз k (К > 0) increases. The dis- 
tribution is rectangular when k = 0, triangular and negatively 
skewed when k = 1, and so on. 

The mean and standard deviation of expression (1) are, re- 
Spectively: 


P = (Е+1/( + 2) (2) 


k+1 
8i = NET 3*6 43) ® 


SS 
* Now with the American Institutes for Research. 
™ 


and 
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The maximum point biserial correlation between a criterion (or 
total test score) and an item passed by a proportion P and 
failed by a proportion Q of a respondent group is attained 
when the criterion distribution is split into identical proportions 
P above and Q below a cutting point Y,, where all respondents 
above Y, pass the item. Y, is found by integrating and solving 
for Y, 


|. ( + 1X* aX = 9 @ 
ог 


TS =, Quit (5) 
Since the mean criterion scores for the pass and fail groups are, 
respectively, 


pe дан) 
IR RE ©) 


and 


kl 
Y 352 k T 2 quur (7) 

The maximum mean criterion score difference between the pass 
and fail groups is 

Bee ik quan 

Berg ae aa (8) 
and the maximum item-criterion point biserial correlation, 
Жа 18 


VEFDEFH а – 9)" ® 


For a rectangular criterion distribution (e.g., a distribution of 
ranks 1,2, . . .,п, as n becomes large), К = 0, so 


та“ = УЗРО (10) 


as shown by Stanley (1968, р. 250). 
The ratio of туь to rpp* is 


(Fp — pyre divided by expression (9) (11) 
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which, for a rectangular criterion distribution, equals 


2(Y, — Ро) (12) 
in agreement with Stanley’s (1968, р. 251) Formula 3 and 
Glass’s (1966, p. 626) Formula 1. Their value n is equal to unity 
in this case, since X ranges from 0 to 1. 

Glass (1966) developed his rank biserial correlation assuming 
that the underlying distribution of the dichotomous variable 
was the same as the variable composed of n consecutive untied 
ranks. Stanley (1968) showed that the ratio of rp) (ranks) 
to rj," (ranks) is equal to Glass's rank biserial, analogous to 
Clemans' (1958) demonstration that for а normally distributed cri- 
terion and an underlying normally distributed dichotomous 
X-variable, 


ъ = Taff" (13) 
where 

та“ = h/VPQ 
and 


ћ = the unit normal distribution 
ordinate at the cutting point 


We generalize this result for any criterion distributed as expres- 
sion (1) and define a biserial correlation r,(k), where the underlying 
dichotomous X-distribution is identical to that of the Y-variable: 
ть(®) is the ratio of r4 (К) to та). 

Stanley (1968) points out that the blowup factor h/ VPQ “will 
tend to overcorrect if the Y-score distribution is more nearly rec- 
tangular than normal, and undercorrect if the Y-score distribution 
18 more concentrated in the center than the normal distribution is 
(p. 252).” In other words, if the criterion is not normally distributed, 
customary calculation of ть via та is a mistake because the wrong 
Та“ is applied. One result is the obtaining of r,'s that exceed unity. 
The maximum biserial correlation of an item with a criterion exceeds 
Unity over the range of item P-values for which the blowup factor 
га“ for a non-normally distributed criterion is larger than h/+/PQ. 
This has been diseussed by Brogden (1946). 

Values of тр* and тр“ are shown in Table 1 for a normally dis- 
tributed criterion and for criteria distributed as expression (1) with 
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k = 0, 1, and 2. Both ть* and ть* are influenced by the degree and 
direction of skew in the criterion, shown also by Adams (1960). 

Typically, ть and туь are applied in item tryouts by a researcher 
wishing to choose a subset of “best” items—those with either 
highest r, or highest rj. When the selection criterion is symmetric, 
it apparently makes little difference whether ть or Тр is used 
(Engelhart, 1965; Aleamoni and Spencer, 1969). 

If however, there is serious departure from symmetry in the item 
selection criterion, the choice of rj or ть = т» VPQ/h can result in 
the selection of two different subsets of “best” items, since the 
function r» and the function ть are not maximized at the same P- 
value (see Table 1). ть maximizes at more moderate item P-values 
than does ть. 

А comparison of r-biserial and r-point biserial item indices 
really begs the question: which index identifies items in a total 
pool that are better than the rest in the sense that they produce 
higher internal consistency within the sample studied? It may 

be sounder to select items that lead to test score distributions 
that are best for a particular application. We should heed Richard- 
son's (1936) warning against "letting difficulty take care of 
itself." 

Item statistics are important insofar as they permit us to 
gauge the characteristics of a test composed of the items examined. 
Tests are important insofar as they lead to good decisions. If 
for example, a test is needed to reject the bottom five per cent 


TABLE 1 
Maximum ry and ть = ry VPQ/h for Items at Varying Difficulty and Various 
Criterion Distributions 


Item Difficulty Normal k=0 k=l k=2 
Р 9 > м ом fh To 
.95 .05 473 1.0 .378 .798 .504 1.065 .561 1.186 
.90 .10 -585 — 1.0 .520 .888 .645 1.102 .692 1.182 
.80 .20 -700 1.0 .693 .990 .782 1.117 .804 1.149 
75 +25 .734 1.0 .750 1.022 .816 1.113 .827 1.128 
.60 .40 -789 1.0 .849 1.076 .849 1.076 .832 1.055 
.50 .50 -798 1.0 .866 1.085 .828 1.038 .799 1.001 
.40 .60 .789 1.0 .849 1.076 .781  .990 .743 .942 
.25 .75 -734 1.0 .750 1.022 .656 .895 .613 .836 
.20 .80 .700 1.0 .693 .990 .597 .853 .555 .794 
.10 -90 .585 1.0 .520 .888 .436 .744 .401 .685 
05 .95 .473 1.0 .378 .798 .312 .660 .287 .606 

— —. 
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of a population for remediation, then its items should be such 
at they are passed half the time by respondents at the critical 
ability point. In this case, many easy items would be assembled 
{0 a tryout form, and a negatively skewed total score distri- 
tion would result in the population. One might expect that if 


‘would туь. 
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COMPUTER PROGRAMS 


This section is provided for the early publication at the expense 
of the author of computer programs relevant to measurement in 
the fields of education and psychology. Customarily, a program 
should be expected not to exceed six or eight printed pages. Manu- 
scripts of four or fewer printed pages are preferred. Each manu- 
script will be carefully reviewed as to its suitability and accuracy 
of content. In some instances an accepted paper may be returned 
to the author for possible revisions or shortening. The cost to the 
author will be forty-five dollars per page plus ten dollars extra 
per page for tables, figures, and formulas. 

Authors are granted permission to have reprints made of their 
articles at their own expense. 

Manuscripts received up to November first will be considered 
for the Spring issue; manuscripts received between then and May 
first will be considered for the Autumn issue. 

All correspondence and duplicate manuscripts should be directed 
to: 


Dr. William B. Michael 
325 Callita Place 
San Marino, California 91108. 
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COMPUTER PROGRAMS FOR THE 
SEMANTIC DIFFERENTIAL 


Е. D. LAWSON, GEORGE Н. GOLDEN, JR., Амр 
KATHY JELONEK CHMURA: 


State University College, Fredonia, N. Y. 


Tue Osgood semantic differential (SD) (Osgood, Suci, and Tan- 
nenbaum, 1957; Snider and Osgood, 1969) has had a great many 
useful applications in psychology. A computer can assist in making 
the SD available for even greater application providing one has ac- 
cess to suitable programs. Five programs for use with the SD have 
been written and two others modified for it. 

The programs can perform the computations for: (1) means and 
standard deviations on subscales, (2) means and standard devia- 
tions on Evaluation, Potency, and Activity factors (EPA Scores), 
(3) Osgood D values for all concepts, (4) the correlation between 
distance measures obtained from EPA scores and Osgood’s D, and 
(5) tests of significance of distances obtained with Osgood D. 

Written in FORTRAN IV for processing on the CDC 6000 
Series, the programs can handle up to 10 subscales on each of 36 
Concepts and up to 50 cases. A flow chart is shown in Figure 1. A 
description of the specific programs follows. 


SDMSM 


This program assumes the use of 8-10 (our studies used nine) of 
the bipolar scales from Osgood such as, valuable-worthless, fast- 
slow, strong-weak, which have been used to rate the concepts. Com- 
puter output includes means, standard deviations, and standard 


errors of the mean. Further procedures using variance or £ could be 
done selectively. 
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_ SDEPA 


In work with the SD, Osgood and others have identified three 
basic dimensions or factors of meaning: Evaluation (Good vs. 
Bad), Potency (Strong vs. Weak), and Activity (Active vs. Pas- 
sive). Williams (1966), Lawson (1970, 1971), and Towne (1971) 
have averaged subscale scores to derive EPA values. These EPA 
values have been useful in construction of a semantic differential 
model (EPA model) following somewhat the procedure of Prothro 
and Keehn (1957) and Towne (1971). The EPA model construction 
procedure makes the assumption (which some purists may ques- 
tion) that Z, P, and A are orthogonal factors and composed of 
equal units. While the rationale of the procedure may raise doubts 
with some, about 30 investigations report that models built with 
EPA averages have closely approximated those built with the 
More complex Osgood D scores. In building an EPA model, each 
concept is plotted in three dimensional space using E, P, and A 
Scores as Y, Z, and X dimensions respectively. In actual practice, 
the EPA scores are doubled and measured in inches. One inch styro- 

- foam balls represent concepts. Dowel sticks connect the balls to one 
another. 

Program EPA, then, combines selected subscales to yield aver- 
age factor loadings on three dimensions: Evaluation, Potency, and 
Activity. The individual investigator may select those particular 
subscales which he wishes to use in combining for EPA scores. 

Of course, a great deal depends upon the subscale which the in- 
Vestigator chooses to represent the three dimensions. Experience 
indicates that the more heavily saturated with (and independent) 
the factors the subscales are, the more closely the model will re- 
Semble the distance (D) model derived from Osgood D values. The 
first card of the data deck contains a key for the subscale numbers 
Which compose each dimension. The output lists each concept and 
associated E, P, and A values. 


SD2PTD 


Those familiar with the Osgood procedure recall that one of the 
statistics developed was D (Osgood, et al. 1957, р. 918), the dis- 
Се in semantic space between two rated concepts. The Osgood 
D is obtained by first taking differences on the subscales. The 
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SD2PTD program obtains the distance between concepts another 
way, by using the generalized distance formula between two 
points in space: 


D= VE = XD + (YF = FY + Z2 = 2 


and substituting E, P, and A for X, Y, and 7. E, P, and A would 
represent scores (means) for the first concept being compared, E1, 
P, and A; for the second. Output is a matrix which indicates the 
distance between each concept and every other. In addition to being 
used in other programs, it does provide a check in the construction 
of EPA models. 


SDGPD 


А major way of looking at SD data is to follow the Osgood tech- 
nique and to determine the D-values between each concept and 
every other concept, According to Osgood, D is a measure of profile 
similarity. Thus, if the concepts GOOD and BAD were each rated 
on nine subscales, the profiles could be compared. D is an index of 
the similarity of two profiles and is the square root of the sum of 
the squared differences between coordinate subscales on the two 
profiles. The larger the D, the greater the difference in similarity of 
ratings; the smaller the D, the greater the similarity. From the ma- 
trix of D’s it is possible to build a semantic differential model (Os- 
good and Suci, 1952; Osgood, et al. 1957; Lawson, 1970, 1971). 
Output of program SDGPD is a matrix of distances between each 
concept and every other. 

As mentioned above, a matrix of coordinates was also developed 
from program SDEPA. Then it was pointed out that it is possible 
to build а model from those values. The method developed by Os- 
good is somewhat more difficult. The concepts are plotted by dis- 
tances between them (the D scores) rather than by X, Y, and 2 
coordinates. Anyone who has built such a model knows what a con- 
fusing and frustrating task it is. 

One of the major difficulties of building the Osgood D model di- 
rectly is confusion over which plane to put the location of the 
various concepts since the plots are made in distances between the 
concepts. The first three or four concepts plot rather easily. It is 
after that that the difficulties begin. However, building an EPA 

model first or drawing one first will significantly speed up the con- 
struction process. The labor time for construction of the D model is 
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o cut down considerably, since the EPA model can be used as а 
de. The investigator may also wish to make use of the next . 
gram which correlates the distances obtained from Program 
D2PTD with those obtained by Osgood’s D formula. 


| This program has been adapted slightly from the LRANK sub- 
routine of the IBM System/360 Scientific Subroutine Package. It 
_ computes the Spearman rank correlation between two sets of dis- 
tance scores on the semantic differential: EPA values-SD2PTD 
from Program EPA and D values from SDGPD. The higher the 
correlation, the greater the similarity in the relationship of re- 
ective values indicating that the two approaches are more likely 
be measuring the same factors. In several investigations with 
5-30 cases in each, correlations have ranged from .87 to .99 in- 
cating a very high level of agreement between the two procedures. 


whether a particular D (output from Program SDGPD) between 
. any two concepts is of statistical significance. The investigator may 
Wish to learn whether the concept UNITED STATES was rated 

gnificantly closer to GOOD MAN or to BAD MAN. То make the 
analysis all of the D's for each case in the sample would have to be 

mputed between UNITED STATES and GOOD MAN and be- 
tween UNITED STATES and BAD MAN. This is а great deal of 
Work. Lazowick (1955), Guptill (1965), and with some vari- 
ation, Williams (1966) have analyzed D scores but the analyses 
have been on a somewhat limited basis. Since the distribution of 
_ D values is not known, the investigators have used nonparametric 
Measures. One of those used has been the Wilcoxon paired replicates 
method. The Wilcoxon has been chosen because it is distribution 
free, Program SDINDD prepares the data for such an analysis. 
For each case the D value is computed between each concept and 
every other concept. 


SDWIL 
Modified from the MPAIR subroutine of the IBM System/360 


Scientific Subroutine Package, Program SDWIL uses the output 
й from SDINDD to compute Wilcoxon paired replicate analyses. 
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The program assigns ranks to two sets of distances. Thus, in the 
illustration above it would be determined whether the group had 
rated UNITED STATES closer to GOOD MAN or to BAD MAN 
and at what level of probability. SDWIL computes the direction 
and significance for each concept against three pairs of distances 
simultaneously. 


Availability 
Copies of the program in FORTRAN IV with descriptive com- 
ments and sample data as run on the CDC 6400 are available on 
request from George H. Golden, Jr., Computer Center, State Uni- 
versity College, Fredonia, New York 14063. 
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SCORING AND ANALYZING STUDENT RESPONSES 
TO TEACHER-MADE TESTS USING THE RCA 
SPECTRA 70/46 TIME SHARING OPERATING 

SYSTEM 


ROLLAND L. BROUSSARD an» PATRICK W. MALLETT 
University of Southwestern Louisiana 


БРЕСТАТ, features of TESCAN (Test Scoring and Analysis), а 
FORTRAN IV program, developed at the University of South- 
western Louisiana Computing Center, are listed below: 

1. Ease of use by teachers who are not computer-oriented. 
(Keypunching is necessary only for one identification pa- 
rameter card. All other cards are Port-A-Punch cards.) 

2. Inclusion of statistical data familiar to most test users. 

3. Program flexibility (three scoring and analysis options avail- 
able to users). 

4. Scores corrected for guessing available under one of the 
options. 

5. Item analysis printed in a form easily usable for item bank- 
ing. Printout may be clipped out and pasted onto a 5" х 8" 
сага. 

6. Scores printed in rank order accompanied by stanines. 

7. Significance test of discrimination available with item anal- 
ysis. 

8. Multiple cards scored (maximum of 6 test cards for 210 
items). 

9. Еазе of expansion of the program above 6 cards and 210 
items. 

10. High capacity (accommodates a maximum of 999 students 

In one pass). 
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11. Batch mode processing possible (multiple data sets inputted 
on one run). 
12. No extra peripheral devices required; i.e., tapes or discs. 


Output Description. 


Seoring, statistical analysis, and item analysis are the basic fea- 
tures of the program. Three options are available to users. The 
first option provides essentially a scoring program with the familiar 
minimum statisties. Option II includes all of Option I plus item 
analysis and additional statistics. Option III produces corrected 
Scores and, based on these corrected scores, all of the test statistics 
available in Options I and II. The user can choose Options I, II, 
III, or + III. 

'Option I includes the following: number of students tested, num- 
ber of test items, highest score, lowest score, range, mean, median, 
mode, variance, standard deviation, standard error of the mean, 
frequency distribution of scores, and individual student scores ex- 
pressed as percentages, accompanied by number right, number 
wrong, and number blank. 

Option II includes all of Option I plus the following calculated 
on number right scores, one point for each correct answer: mean, 
median, variance, standard deviation, standard error of the mean, 
апа standard error of measurement. Kuder-Richardson 21 and 
item analysis including option response frequency distribution, 
difficulty level, diseriminating power and significance of discrim- 
ination are also calculated in Option II. 

Including all of the information in Options I and II, Option 11 
generates and uses corrected scores caleulated on the basis of: 


SWISS 


R = number right 
W = number wrong 
O = number of options. 


Input Description 


Students record their responses on IBM Port-A-Punch Cards. 
The answer key is also a Port-A-Punch Card. The student identi- 
fication field of the first answer key card is used to indicate the 
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TABLE 1 
Partial Printout of Student Scores (Option I) 
RL BROUSSARD EDUC 370 TEST П-Х ТТН 2-315 SP’70 


Е п Student Stanine Percentage Number Number Number 
Code Sco: Score Right Wrong Blank 
2 94.29 33 2 0 
2 88.57 31 4 0 
2 17 82.86 29 6 0 


option by the user. The only card requiring keypunching is the 
parameter card in which an “L” is punched in the first column and 
hen followed by identifying information such as instructor’s name, 
course number, or section. Users selecting Option III must fill out 
an additional set of Port-A-Punch cards to indicate the number of 
»ptions available for each item. This set of cards is placed after the 
y cards, This arrangement permits items with different numbers 
options to be scattered throughout the test. The input order, 
a пред by groups, follows: 

GroupI Identification card 

Group II Answer key cards 

Group ПІ Blank card 

_ Group IV Option distribution cards (necessary for Option Ш 
1 only) 

Group V Student Port-A-Punch response cards 

Group VI Blank card 


i Output Format 

3 Exerpts from a printout are presented below. Table 1 is a sample 
Printout of student scores under Option I. Table 2 contains student 
ores corrected for chance (Option III). Table 3 consists of a por- 
‘Mon of an item analysis printout. 


TABLE 2 
Partial Printout of Corrected Student Scores (Option III) 


R L BROUSSARD EDUC 370 TEST П-Х ТТН 2-315 SP'70 
Student ID Stanine Score Percentage Score Number Right 


4 9 92.86 32.50 
1 8 85.71 30.00 
11 8 78.57 27.50 
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TABLE 3 
Partial Printout of Item Analysis 
RLBROUSSARD EDUC 370 TEST П-Х TTH 2-315 SP’70 
Item Analysis 
Item Response Item Item f 
Number Group 1 2 3 4 5 Difficulty Discrimination Significant 
U 239-000 
1 0.773 0.311 № 
L 3 8 000 
U 0 is 000 
2 1.000 0.000 No 
L оп ооо 
* Indicates correct response. 
Summary 


The authors believe that the special features of the program pre- 
sented at the beginning of the article make this program especially 
useful for the scoring and analysis of teacher-made tests at any 
level of instruction. Because the value of an automated test scoring 
and analysis program has been presented well in many other refer- 
ences, no such documentation is offered here. The range of analysis 
available, from simple scoring to corrected scores with item anal- 
ysis, makes the program attractive to both inexperienced and ex- 
perienced test administrators. 


Additional information may be obtained from the authors. 
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A COMPUTER PROGRAM FOR KEYING OPTIONS OF 
MULTIPLE-CHOICE TESTS TO INCREASE INTERNAL 
CONSISTENCY? 


RICHARD R. REILLY лмо BARBARA J. DYNARSKI 
Educational Testing Service 


A high degree of internal consistency is a desirable property for 
most multiple-choice tests to have. The purpose of the program 
described is to increase internal consistency by assigning empir- 
ically derived weights to the options of each item. A procedure 
similar to that described was used by Hendrickson (1971) in a 
successful attempt to increase the internal consistency of subtests 
of the Scholastic Aptitude Test. 

Guttman (1941) first suggested that the maximum product mo- 
ment correlation between a set of categories and a continuous crit- 
епоп will be achieved when each category is weighted with the 
mean criterion score of all persons choosing it. Stanley and Wang 
(1970) and, in a somewhat different context, Beaton (1968) gave 
independent proofs of this method. 

Guttmann’s method forms the basis for the program described 
which keys each option of a multiple-choice question by assigning 
the mean standard score on the remaining items for all persons 
choosing that option. An option is defined as any set of mutually 
exclusive response categories. Thus, the responses “omit” and “not 
Teached” could also be keyed. 

The steps in the keying procedure are as follows: 


Step 1. 


All items are scored initially using some a priori weighting 


*Funds used to develo і 
а р the program were provided by the Graduate 
Record Examinations Board. н Р i 
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scheme (e.g, right = 1, wrong = —1/c, where c is one less than 
the number of choices, and omit = 0). Compute all item means, 
variances, and covariances. In addition, compute total test score 
mean and variance. 

At this point the internal consistency coefficient (coefficient a) 
may be computed as follows 


where S? and 8,2 are the item and total test score variances and 
m is the number of items. 


Step 2. 

Weight each response of item ? by assigning the mean standard 
score on the т — 1 remaining items of all individuals choosing 
that response. 


Step 3. 
"Using the weights derived in Step 1, rescore all tests. 


Step 4. 

Repeat Steps 1, 2, and 3 with the new score distribution and 
continue iterations until the desired number is reached or until 
the increment in а is less than some predetermined level. The 
weights derived at the final iteration are used to rescore the tests а 
final time so that the last internal consistency coefficient can be 
computed. 

The procedure outlined here keys on standard scores to avoid the 
differences in mean item weights which might result from initial 
differences in mean item scores (remembering that the keying for 
each item is done on the scores on the m — 1 remaining items). 
Standardization also serves to prevent the total test score from 
becoming increasingly large. Finally, the necessity of recomputing 
the entire standard total score distribution for each of the m — 1 
item “tests” is avoided by using the relationship: 


Xa — Wa — X + Ki 


Wa = » 
S, Ф7 S? =@ >; S; 
fei 
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is the weight assigned to option k of item # and is the mean 
standard score of all persons choosing option k of item $ on 
the score distribution of the m — 1 remaining items; 

is the mean m item score for all persons choosing option k 
of item 7; 

is the original weight assigned to option k of item $; 

is the m item score mean; and 

€, is the mean йет score for item i. 

To avoid too much shrinkage when keys are applied to a new 
up, the keying sample should be relatively large. Because of 
g error, final « estimates should be obtained on a holdout 


Program Description 


the program is written in FORTRAN ТУ for the 360/65 system 
uses as input the item responses of N persons for a set of m 
. Initial scoring is done using formula score weights (i.e., right 
er = 1, wrong = —1/c, omit = 0). Maximum capacity is 10 
ories and 100 items. 

program prints out the а coefficient and the complete set of 
tion weights obtained after each iteration. A listing of the pro- 
m is available from the authors on request. 3 
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А COMPUTER PROGRAM TO GENERATE RELIABILITY 
INDICES FOR COMPOSITE TESTS INCLUDING A CROSS- 
VALIDATION TECHNIQUE! 


WILLIAM D. SCHAFER 
University of Maryland 


THe computer program discussed here produces four different 
indices of reliability for a composite test (one which is divided 
into several subtests, but which yields a total score). The output 
includes coefficient, alpha and the coefficient from the Jackson and 
Ferguson (1941) battery reliability procedure, which uses the 
coefficients alpha of the subtests in the formula for the reliability 
of а sum in terms of the reliabilities of and the intercorrelations 
among its components. A cross-validation technique, described be- 
low, is used to obtain two new types of split-half coefficients, which 
also appear as output. 

The examinees are randomly split into two groups, group E 
(experimental) and group V (validation), in the cross-validation 
technique. For each subtest, using the group E data, comparisons 
are made among all possible splits of the items into two halves, in 
order to determine which split yields the highest split-half reli- 
ability coefficient. This “best split” is then employed to calculate a 
Corrected split-half reliability coefficient—using, for the purposes of 
cross-validation, just the data from group V. For each subtest, 
Output includes the subtest coefficient alpha, the “best split,” the 
Corrected “best split-half” coefficient using the data from group 
V, and the correlations between the subtest and each of the other 
subtests, 

For the total test, the “best splits” of each of the subtests are 


aa p 
ties S Computer time for this project was supplied in full through the facili- 
е Computer Science Center of the University of Maryland. 
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combined to create similar half-forms, and this partitioning of the 
items is used with the data from group V to calculate a corrected 
split-half coefficient. This partitioning produces a form of the 
symmetrical-split coefficient, as suggested by Cattell and Butcher 
(1968). The split, itself, also appears as output. Additionally, the 
formula for the reliability of a sum is applied using the “best- 
split” coefficients of the subtests and their intercorrelations. The 
size of the total group and the size of the validation group appear 
as output. 


Input 
Input to the program includes two title cards, a parameter card, 
subtest definition cards, a variable format card, and examinee item 
scores. If the scoring is dichotomous, the user has the option of 
providing an answer key, and the program will then provide item 
scoring. 


Limitations 
In order to conserve computer time, the maximum number of 
items has been set at 160. These must be divided among at most 


15 subtests with a maximum of 18 items on any one subtest. 
However these limits can easily be changed. 


Computer Language and Subroutine 


This program is written in FORTRAN V for the UNIVAC 1108 
computer. A uniform random number generator is the only subrou- 
tine called. 


Availability 


A listing of the program and a program description may be 
obtained by writing to Dr. William D. Schafer, Department of 
Measurement and Statistics, College of Education, University of 
Maryland, College Park, Maryland 20742. 
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A COMPUTER PROGRAM FOR RANDOMLY SELECTING 
| TEST ITEMS FROM AN ITEM POPULATION 


ROBERT 8. BARCIKOWSKI 
AND 
JERRY L. PATTERSON 
Ohio University 


“In teaching a particular course over a number of years, an edu- 
Gator often collects a large number of test items. The program de- 
scribed in this paper obviates the need for the usually tedious process 
selecting different items each time a new test is desired. Specif- 
ally, the program randomly selects and prints a desired number 
items from an item population which may consist of as many 
000 items. For each item the program randomly selects and 
ts a desired number of distractors from a maximum of nine 
ible distractors. Although a somewhat similar procedure has 
en reported by M. I. Charles E. Woodson (1968), the described 
am offers several distinct advantages for the test writer: 
Multiple choice, true false, and completion items can be used. 
Any item stem can utilize a maximum of 11 data cards. 

Any distractor can utilize a maximum of 11 data cards. 

For a given item, the educator can randomly select a desired 
number of distractors from those available. 

The program has a simple format. Only one control card is 
_ hecessary and the items and distractors are arranged on cards 
à using a scheme which is easy to follow. 

6. The items can be read from cards or from a magnetic tape. 

T To facilitate test preparation, the program output provides 
a (properly labeled) the test item number, original item number, 
_ item answer (randomly positioned), the number of distractors 
| Available, and the number of distractors desired. 
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Input 


Control Card (I FORMAT) 


Columns 1-4: 
5-8: 

9-12: 

13-22: 

23: 

24: 


25-26: 


Number of items available (maximum of 1000). 
Number of items desired. 

Number of distractors available (can be changed for 
any question, maximum of nine). 

An arbitrary number which initializes the pseudo- 
random number generator (an odd nine digit 
number). 

Blank. 

Any single digit integer, if the tape option is used. 
Otherwise, leave column (24) blank. 

Tape number, if the tape option is used. Otherwise, 
leave columns (25-26) blank. 


Multiple Choice Items 


Columns 1-4: 
5-78: 


79: 


80: 


Item number (I FORMAT). 

Test item (A FORMAT). 

If the item stem requires more than one card, repeat 
the above for each card of the item stem, then, on 
the final card of the item stem in column 

Number of distractors available, if different from 
original information supplied in columns 9-12 of 
the control card. Otherwise, leave column (79) 
blank (I FORMAT). 

Number of distractors desired (1 FORMAT) 


Distractors (The first distractor is always the correct answer.) 


Columns 1-4: 
5-8: 
9-79: 


80: 


Blank. 

Distractor number (I FORMAT). 

The distractor (A FORMAT), 

If a distractor requires more than one card, repeat 
the above for each card of the distractor, then, on 
the final card of each distractor in column 

Number 1 (I FORMAT), 


True-False Items 


Columns 1-4: 
5-76: 


Item number (I FORMAT). 

Test item (A FORMAT). 

If the item stem requires more than one card, 
repeat the above for each card of the item stem, 
then, on the final card of the item stem in column 


T2 - —— — 
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77: Letter Х, if the answer is false. Otherwise, leave 
column (77) blank (A FORMAT). 

78: Letter X, if the answer is true. Otherwise, leave 
column (78) blank (A FORMAT). 

79: Blank. 

80: On the final card of the question, type the number 1 
(I FORMAT). 


the number of cards for the item answer must not 
exceed eleven cards.) 
Question 
s 1-4: Item question number (I FORMAT). 
5-78: Item question (A FORMAT). 
79-80: Blank. 
If the item question requires more then one card, 
repeat the above for each card of the item question. 


lumns 1-4: Item question number (I FORMAT). 
5-78: Item answer (А FORMAT). 
79: Blank. 
И the item answer requires more than one card, 
repeat the above for each card of the item answer, 
then, on the final card of the item answer in column 
| 80: Number 9 (I FORMAT). 
No distractor cards are necessary for true-false or completion 
The test writer should note that although the first distractor 
I the item population is always the correct answer to multiple- _ 
hoice items, on the test generated by the program the correct answer 
sin a random position among the distractors for each item. 
4 Capabilities and Limitations 
пе program is written in FORTRAN IV. Compile and load 
“9 оп Ohio University’s IBM model 44 G-level compiler was 
Pproximately twelve seconds. Execute time for a twenty question 


Selected from 100 available questions was approximately 
Wenty-seven seconds. 


| _ Availability 
- Copies Of a source listing of this program with example output 
F be obtained by sending two dollars ($2.00), for Xerox, mail- 
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ing, and handling, to Dr. Robert S. Barcikowski, Ohio University, 
Department of Educational Research, Statistics, and Evaluation, 
McCracken Hall, Athens, Ohio 45701. 
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AN ITEM ANALYSIS PROGRAM WHICH PROVIDES 
FEEDBACK TO INDIVIDUAL STUDENTS 


ALBERT С. OOSTERHOF ax» A. THEL KOCHER 
The University of Kansas 


Tuts FORTRAN program scores responses оп а multiple choice 
test when these responses have been punched into data cards using 
IBM-1230 code. The program then subjects the test to an item 
analysis, and provides optional printed and/or punched output 
indicating scores earned by individual students and optional feed- 
back to each student relative to his performance on the test. The 
optional feedback consists of a separate printed page for each 
student indicating, by brief descriptions, the specific content areas 
which were missed. The feedback is especially appropriate for 
large class situations in which a high degree of interaction with 
each individual student is difficult, and would also be appropriate 
in settings in which a mastery model or criterion grading procedure 
18 used. 


Input 
Separate input cards are used to indicate the title of the test, to 
Select the desired options of the program, and to provide the format 
statement through which data cards are read. The data cards contain 
student responses to the individual test items, these cards being 
Preceded by a card containing the correct responses. 


Output 
Output from this program includes: 
1. Mean and standard deviation of raw scores. 
2. Reliability coefficients and standard errors of measurement 
Using split-half and Kuder-Richardson (formula 20) procedures. 
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3. Rank-order listing of raw scores, and corresponding Z-score 
and percentile ranks. 

4. Histogram indicating frequency at which each raw score Oc- 
curred. 

5. Description of each test item using the following indices: 

a. Difficulty and discrimination. 

b. Percent of students choosing each option. 

с. Percent of students choosing each option when classified into 
the three categories of upper 25%, middle 50%, and lower 25% 
relative to their total test scores. | 

4. Proportion of students choosing each option when divided into 
decile groups according to their total test scores. 

e. Item characteristic curve. 

6. Listing by student identification number of raw-scores, Z-scores, 
and percentile ranks (optional). 

7. Printed page for each student which includes the respective 
student’s raw-score and Z-score on the test, along with a listing 
of specific content areas requiring additional study (optional). 

8. Punched output with each card indicating a student’s identi- 
fication number, the name of the test, and the student’s raw- 
score on the test (optional). 


Capabilities and Limitations 

This program, which is written in FORTRAN ТУ, will accom- 
modate a test with a maximum of 999 items, however, it places no 
limit on the number of students included in the analysis. The pro- | 
gram is currently structured to take advantage of the "dynamic 
core” feature of the Honeywell 635 System; however, minor modi- 
fication of the program would make the program compatible to 
systems without a corresponding feature (ie., a fixed core alloca- 
tion of 20K words would allow for a maximum of 50 items, or 30K 
words would allow for a maximum of 150 items). The program ге- 
quires one tape or disc unit for use as a scratch file. \ 


Availability 
A listing of this program which includes illustrative input 
output can be obtained by writing either author, Bureau of Edu- 


eational Research, 205 Bailey Hall, The University of Kansas, 
Lawrence, Kansas 66044. 
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INTRACLASS CORRELATION AS A RELIABILITY 
CHECK ON NOMINAL DATA 


JIM MINTZ 
University of Pennsylvania 
CARL WIEDEMANN 
John Jay College of Criminal Justice 


Many research problems require judges to assign objects or 
stimuli (often people) to discrete categories. The results of these 
categorical assignments may be used as a dependent or an inde- 
pendent variable for further research. Concerned about the re- 
liability of these judgments, the investigator may wish to know, 
before proceeding with further study, the extent to which judges 
agree on their assignments. 


Purpose 


The procedure and program here described are designed to as- 
sess the reliability of J judges who are assigning N stimuli to one 
of K categories. The data must be complete, and each stimulus must 
be assigned to only one category by each judge. The categories 
are then successively dichotomized according to the logic of the 
Tesearch and the wishes of the researcher. 


Example 


In a typical case in clinical research, judges might be asked to 
aen patients to one of four categories: “organic,” “psychotic,” 
neurotic” or “normal.” Table 1 presents fictitious data illustrating 
such a problem. The researcher then supplies orthogonal contrast 
Codes, dichotomizing categories as is appropriate to the problem. 
Procedures for writing orthogonal contrast codes may be found in 
Several texts (e.g., Edwards, 1964). While the program does not 
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TABLE 1 


Subject, Judge 1 Judge 2 Judge 3 
1 psychotic psychotic psychotic 
2 psychotic psychotic neurotic 
3 psychotic organic organic 
4 psychotic neurotic psychotic 
5 organic organic organic 
6 organic neurotic neurotic 
T neurotic neurotic normal 
8 neurotic neurotic neurotic 
9 neurotic normal normal 

10 normal normal neurotic 


Fictitious Categorizations of Subjects 


require that the codes used be orthogonal, such contrasts have the 
advantage of fully utilizing the information available. There will 
be k — 1 contrasts for k categories—a circumstance which ex- 
hausts the degrees of freedom for categories. For the fictitious 
example above, three contrasts of interest might be as follows: 
(a) normal vs. all others (presence or absence of pathology of any 
kind); (b) organic vs. other pathology (differential diagnosis of 
organicity) ; (c) psychotic vs. neurotic. Contrast codes for these 
three comparisons are presented in Table 2. D 

The program computes intraclass correlations for each com- 
parison, successively assigning the contrast code values as “scores” 
for each judge’s categorization of each case. The intraclass cor- 
relation indexes the degree to which subjects are being rated the 
same across judges, in terms of the contrasts (Haggard, 1958). | 
It ranges from zero to +1, and its significance is judged by thi 
associated F-test for between subjects variance. The intraclass o 


TABLE 2 
s Matriz of Contrast Code Weights 


Contrast 
Normal vs. Organic vs. Psychotic vs. 
Category all others other pathology Neurotic 
Psychotic -1 1 1 
Neurotic -1 1 —1 
Organic -1 -2 0 
Normal 3 0 0 
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relations for the example, and the analyses of variance, appear in 
Table 3. 

An infinitude of contrasts could have been generated for the 
problem above, instead of those in Table 2. The contrast codes 
chosen are dictated by the interests of the researcher. If, in the 
example, the investigator thought the basic agreement should be on 
severity of pathology, he could have made the first contrast “organic 
and psychotic” vs. “neurotic and normal”, The remaining two 
contrasts could then compare the categories within each set, Usually, 
the nature of the research will naturally structure the contrasts 
used. 


Summary 

The procedure described yields estimates of the reliability with 
which judges employ nominal categories, when such categorical 
assignment has been sequentially dichotomized into a series of 
orthogonal contrasts. The program is written in Fortran IV for 
the CDC-6600. The user must specify k — 1 sets of weights for 
contrasts among the К categories. The program then assigns these 
Weights as scores to each subject, and computes a one-way analysis 
of variance for each of the  — 1 sets of scores. 

Output from the program includes the scores assigned to each 
Subject for each of the contrasts, the analysis of variance for each 
contrast, the intraclass correlation, and the weights for each con- 
trast. Current limits are 20 categories, 100 subjects, and 10 judges, 
but these can be enlarged by the user. Program and sample data 


TABLE 3 
Analyses of Variance for Fictitious Data 


Contrast 1—Normal vs. all others 
58 df MS 


Source R 
Between Ss 34.67 9 3.85 2.41* .32 
Within Ss 32.00 20 1.60 

Contrast 2—Organic vs. other pathology 
Between Ss 27.37 9 3.04 4.34** .53 
Within Ss 14.00 20 .70 

Contrast 3—Psychotic ув. neurotic 
Between Ss 10.03 9 1.12  2.57* .34 
Within Ss 8.67 20 43 


“p< .05. 
"р< о. 
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are available from the authors on request. Comment cards in the 
program indicate required input. 


REFERENCES 
Edwards, Allen. Experimental design in psychological research. 
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А GENERALIZED NONPARAMETRIC ANALYSIS 
OF VARIANCE PROGRAM? 


JAMES J. ROBERGE 
Temple University 


McSweeney, 1967), and (d) calculate ved hoc т compari- 
dm (Dunn, 1964; Nemenyi, 1963; Rosenthal and Ferguson, 1965). 


Control Cards 


1 = Nonparameteie test (1 =  Kruskal-Wallis; 
2 — Friedman; 3 — Cochran) 
2-3 = Number of samples or experimental condi- 
tions (k) 
4 = Trend analysis (1 = yes; 0 = no) 
5 = Ferguson's test (1 = yes; 0 = no) 
6 — Marascuilo and McSweeney test (1 = yes; 
0 = no) 
7 = Multiple comparisons (1 = yes; 0 = no) 
8 = All possible pair-wise comparisons (1 = yes; 
0 = no) 
9-11 = If column 7 is 1, and column 8 is 0, = the 


the author gratefully acknowledges the support for this research which 
48 provided by a Faculty Research grant funded by Temple University. 
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number of comparisons is punched in these 
columns; otherwise they are left blank. 

12 = Nemenyi’s test (1 = yes; 0 = no) 

13 = Dunn’s test (1 = yes;0 = no) 

14 = Rosenthal and Ferguson test (1 = yes; 
0 = no) 

15-20 = If column 1 is 2, and column 14 is 1, then the 
F-ratio required for significance at the .05 
level (df = k — 1, n — k + 1) is punched in 
these columns (Note: the decimal must be 
punched); otherwise, they are left blank. 

21-26 = If column 1 is 2, and column 14 is 1, then the 
F-ratio required for significance at the .01 
level (ај = k — 1, n — k + 1) is punched in 
these columns (Note: the decimal must be 
punched); otherwise, they are left blank. 


Contrasts matriz format card. 


This F-type variable format card describes each row of the 
arbitrary contrasts matrix. This format may be punched in any 
of the columns on the card. If column 7 on the problem card is 
0, or column 8 is 1, then this card is omitted. 


Arbitrary contrasts matrix 


This matrix is entered one row at a time. Each row must begin 
on a new card and must have k weights indicating the contrast 
to be made. These cards must be punched in accordance with the 
F-type contrasts matrix format card (see above). If column 7 
on the problem card is 0, or column 8 is 1, then these cards are omit- 
ted. 


Coefficient cards 


If columns 4 and 6 on the problem card are 1, then these cards 
contain the linear, quadratic, and cubic (if k > 3) coefficients of 
the orthogonal polynomials; otherwise, they are omitted. Each 
set of coefficients must begin on a new card and must be punched 
according to 2014 format. 
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Data format card 


This F-type variable format card indicates the location of 
the raw scores (or ranks) on the data cards. This format may be 
punched in any of the columns on the card. 


Sample card(s) 


A card (or cards) indicating the size(s) of the sample(s). For 
the Kruskal-Wallis test, the number of subjects in each sample is 
punched on the card(s) using 2613 format. If column 5 on the 
problem card is 1, the sample sizes are punched in the same order 
as the hypothetical ranking of the independent samples. Further- 
more, if column 6 on the problem card is 1, equal sample sizes are 
required, 

For the Friedman or Cochran test, the number of subjects in 
the sample (or matched samples) is punched on the сага using 
13 format. 


Data deck 


These cards containing the data for each sample (or experimental 
condition) must be punched in accordance with the format spec- 
ified on the F-type data format card (see above). For the Kruskal- 
Wallis test, the data are punched by sample with the data for each 
sample beginning on a new card. In addition, if column 5 on the 
Problem card is 1, the samples must be arranged sequentially 
according to the ordered hypothesis. 

For the Friedman or Cochran test, the data are punched by 
subject (or group of matched subjects) with the data for each 
Subject beginning on a new card. Moreover, if column 5 on the 
Problem card is 1, the data for each subject must be punched in 


the same order as the hypothetical ranking of the correlated 
Samples, 


Last card 


If the user wishes to terminate the program, then the card im- 
mediately following the data deck must have the word FINISH 
Punched in columns 1 to 6. However, if the user wishes to analyze 
another set of data, then this card is a blank one and the job 


ag is arranged sequentially (as described above) beginning with 
е problem card, 
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Capabilities and Limitations 


The program is written in FORTRAN IV for processing by 
computers in the IBM 360 (or the CDC 6000) series. It can handle 
& maximum of 30 samples (or experimental conditions) and 200 
subjects per sample (or experimental condition). Jobs may be run 
sequentially as described above. 

The output varies according to the statistics desired and is 
similar to that described in previous papers (Roberge, 1970, 1971a, 
1971b, 1972). 


Availability 
Copies of this paper and a source listing which includes input 
and output data for sample problems can be obtained by writing 


to Dr. James J. Roberge, Temple University, Department of Edu- 
cational Psychology, Philadelphia, Pennsylvania 19122. 
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‚ COMPUTER PROGRAM FOR SELECTING OPTIMUM 
PLE SIZE AND NUMBER OF LEVELS IN A ONE 
WAY RANDOM EFFECTS ANALYSIS OF VARIANCE 


ROBERT 8. BARCIKOWSKI 
Ohio University 


In a one way random effects analysis of variance, researchers 
te faced with the problem of having to draw inferences about an 
set of distinct treatments or factor levels. In this case the 
archer is not interested in the values of the individual treatment 


were randomly selected. 

or example, suppose that a population of school teachers is 
vailable to teach reading to first grade pupils using a certain 
hod. It is decided that the method will be adopted for use pro- 
its success is not heavily dependent on the personalities of 


domly assigned to a random sample of classes of first grade 
pils. The dependent variable (a standardized reading test) is 
elected and a one way analysis of variance, random effects 
el, is to be used to test the hypothesis that there are no signif- 
t differences among the teachers. 

archers faced with the preceding problem are often ignorant 
4 ow many teachers to select and the number of students to as- 


Background 
the random effects model, the null hypothesis is generally 


Hic," < 6,0 
811 
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with the alternative being 


Hao,’ > 6.9... 
Here од? is the variance of the effects from factor А; c,? is the 
variance of the sample elements, and 6, > 0 is a preassigned 
constant. 

Then the power of the F test (Scheffe, 1959; Guenther, 1964) is a 
function of 6 = o42/o,2. Here 0 is Cohen’s (1969) "effect size” for 
the random effects model. That is, it is an index of the degree of 
departure from the null hypothesis which we want to detect. So 
that, р 


Power = Pr(F(L — 1, М — Г) 
> F(a; Г — 1, N — Га + пб.) / (1 + n6)} 
where 


F = the F statistic, 

L = the number of levels, 

a = the level of significance, 

N = the total number of elements, 

n = N/L, the number of elements in a cell of the design. 


That is, power for any one way random effects analysis of vari- 
ance is found using the central F distribution and is calculated by 
finding the probability of drawing an F value from the central F 
distribution having L — 1 and N — L degrees of freedom which is 
greater than or equal to the value F(a; L — 1, N — L) (1 + п6)/ 
(1 + n4). 


Procedure 


The program allows one to select the optimum number of levels 
(L) to be used with a fixed total number of people (N) such that — 
the power of the statistical F test is maximized. Here, N/L = 
is the number of people in each cell of the design. Equal ws in 
each cell must be assumed, since some difficulties arise in the 
random effects analysis unless there are equal numbers of observa- _ 
tions (Hays, 1965, p. 419). E 

In each case the possible number of levels may range from two 
to N/2 since at L = 1, c4? can not be estimated, and at L = М. 
neither сд? nor ог“ can be estimated. For example, if 0 = .5, 0, = б, 
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а= .05 and М = 20, the program would generate the following val- 
ues for the levels from two through ten (N/2): 


Levels (Г) Power 


«o 00 -1 ي‎ л rR о № 
* 


10 .26448 


Maximum power (.45105) is reached when there are four levels 
with five (N/L) elements under each level. 


Input 


Problem Card 1 
Columns 1-4 = Number of data sets to be read (i.e., the number 
of times Problem Cards 2—4 will be repeated). 


Problem Card 2 
Columns 1-3 = Alpha, level of significance for the F test. 
4-6 = Number of 6’s to be read. 
7-9 = Number of N’s to be read. 
10-13 = 6,, hypothesis value. 


Problem Card 3 
Columns 1-3 = First value of 6. 
4-6 = Second value of 0, etc., up to 20 values. 


Problem Card 4 
Columns 1-4 = First value of М. 
5-8 = Second value of N, etc., up to 20 values. 


Output 
The output includes the alpha level; N, the total number of 
шон in the design; 6, the index of degree of departure from 
he null hypothesis; 6, the hypothesis value; and, a table of possible 
random levels and their corresponding power values for each of 
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the preceding parameters. If a power value is greater than .9995, 
new М, 6, or data set is considered. 
Availability 
The program is written in FORTRAN IV for the IBM 360 Model 
44 G-level compiler at Ohio University. Copies of a source listin, 
of this program can be obtained by writing to Dr. Robert 8. 
Barcikowski, Ohio University, Department of Educational Re- 


search, Statistics, and Evaluation, McCracken Hall, Athens, Ohio 
45701. 
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ANALYSIS OF COVARIANCE WITH MULTIPLE 
COVARIATES (COVARMLT) 


JOHN D. WILLIAMS Ax» ALFRED C. LINDEM 
The University of North Dakota 


Awatysis of covariance programs are typically available, but 
many of these programs severely limit the number of covariates, 
usually to one or two covariates. This limitation is wholly un- 
necessary. The analysis of covariance can be conceptualized as 
being completed through the use of two linear models, and a mul- 
tiple linear regression solution follows in а straight-forward man- 
ner. 

The first model, or full model, can be given by 


Y =b, + ЫХ, + ЫХ, + +: ВХ, + деле 

Tec Веља Хель а (1) 
where X, through X, are the c covariate variables and X, through 
Xox- are binary coded group membership variables (1 if a member 
of a group, 0 otherwise) for the first k — 1 of the k groups. The 
Second model, or restricted model, can be given by 


Y = b, + 6X, БХ, ob OX. + 6 (2) 
where X; through X, аге the с covariate variables. The test of signif- 
icance сап be constructed from the resulting multiple correlation 
coefficients (R), and is given by 

1m (Rew — Rew )/k ЖА 
P-UUG-RAS/N —e— k © 
where Ry? is the square of the multiple correlation coefficient for the 
full model, Rex? is the square of the multiple correlation coefficient 
for the restricted model and М is the total number of subjects. 
Equations 1, 2, 3 conceptually explain the process in COVARMLT; 


815 
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however, the program does not presuppose an understanding of 
linear models. The user can think of the program as allowing 
several covariates in the solution of an analysis of covariance; the 
program will do the actual computations. 


Input 
Data cards contain for each observation the criteria, covariates, 
and group membership variables in any format or order. The pa- 
rameter cards specify problem identification, total number of vari- 
ables, number of criteria, and optional printout of data. The 
criteria card specifies which variables are to be criteria. The 
covariate card specifies which variables are covariates. The group 
membership card specified k — 1 of the k binary coded groups. 
Limitations 
The maximum dimensions are as follows: 
99,999 observations, 50 variables including the criteria vari- 
ables, covariate variables, and group membership variables. This 
program thus will allow up to 49 covariates (when there are two 


groups and a single criterion). The dimensions can be increased or 
decreased to fit local needs and limitations. 


Computer and Program Language 


This program is written in FORTRAN ТУ level F for the IBM 
360-40 ; the program will run in a 54k partition. 


Output 

The analysis of variance and analysis of covariance summary _ 
tables are reported for each criterion. Also included are the means 
and adjusted means for each group. The within group regression co- | 
efficients are also included. The following information regarding the _ 
regression solutions is also available for the full and restricted 
models: means, standard deviations, regression weights, standard _ 
error for regression weights, computed £ value, and beta weights for 3 
each predictor (covariate and group membership) variable. The _ 
multiple correlation coefficient (В), R?, 1 — В? and the analysis of _ 
variance for the regression are also included for the full and re- _ 
Stricted models. 4 
A printout of the program and sample output will be supplied оп — 
request. 
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AN EVALUATION AND COMPARISON OF 
THREE COMPUTER PROGRAMS FOR 
APPROXIMATION OF THE DISTRIBUTION 
FUNCTION OF THE F STATISTIC* 


WILLIAM K. BROOKSHIRE 
North Texas State University 


ЈАБРЕМ (1965) reported a FORTRAN computer program for ap- 
proximating the probability P associated with a given F with de- 
grees of freedom for the numerator and denominator. This program 
was later evaluated by Golden, Weiss, and Dawis (1968) who 
found “with certain important exceptions, the approximation is 
quite accurate.” The National Bureau of Standards recently pub- 
lished a subroutine, PROB, (1970) to perform the same task. 

Subroutine PROF, which is the investigator’s modification of the 
IBM subroutine BDTR, may also be used to calculate the prob- 
ability of an F. The purpose of this study was to evaluate these 


three techniques with respect to accuracy and speed of computa- 
tion. 


The method of evaluation used in this study was patterned after 
that used by Golden, et al. (1968) in their evaluation of Jaspen’s 
(1965) FORTRAN subroutine. Basic to this method is the calculation 
of percent error; that is, the probability was approximated for an 
exact Р value” having a particular probability level and degrees of 
freedom. If P is the exact probability value and P is the approximate 
probability level, then the per cent error" px is defined by the equation 


——— 
1This research was supported by North Texas State University faculty re- 
Search grant, 35590. 
2 The exact values of F were taken to be those in the tables of M. Merring- 
ton and С. М. Thompson in Biometrika, Vol. 33, 1943-1946. 
thy The possible consequences in decision making associated with the sign of 
е per cent error were discussed by Golden, et al. (1968). 
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In this study per cent error was calculated for Р equal to .05, .01, 
and .005. 


РЕ = Х100. 


Jaspen’s Subroutine 


Replication of the Golden, et al. (1968) evaluation of Jaspen’s 
subroutine led to confirmation in general of his results. The per 
cent errors reported by Golden, et al. were not duplicated exactly 
by this study, but the differences were small, and the sign of the 
errors was in agreement. 

Table 1 reports the per cent error for the F statistic when com- 
puted by Jaspen’s subroutine for various degrees of freedom (n, 
numerator; d, denominator) at the .01 level. The per cent error for 
the Е statistic is given in the first column of Table 1 (Рад = &4?). 
The per cent error for the x? statistic is given in the last row (as- 
suming df in the denominator of 1000 as equivalent to оо, Раши = 
Xain /n). 


Subroutine PROB 


PROB is a subroutine in the National Bureau of Standards pro- 
gram OMNITAB II by Sally T. Peary, Ruth N. Varner, and 
David Hogben (1970). This subroutine uses the mathematical 
formulas presented by Abramowitz and Stegun ( 1964, p. 946). 

Only minor changes were necessary in subroutine PROB, as it is 
found in the Source Listing of OMNITAB II (1970). Depending 
on the computer make and model, it may be necessary to change 


TABLE 1 


The Per cent Error for the F Distribution Function at Various Degrees of 
Freedom for the Numerator n and the Denominator d when P = .010— 


Calculated by Jaspen’s Subroutine 
0 ___ 
1 2 4 10 20 60 1000 


м 


1 23.8238 29.1126 30.8911 31.7093 31.9490 32.1029 32.1507 
2 —26.3465 —17.0893 —12.1330 —9.0525 —8.0071 —7.3061 —6.9775 
4 12.5802 18.9084 23.3496 27.0753 28.6290 29.8004 30.5520 
10 —1.6970 1.2222 2.7480 4.3421 5.3612 6.3920 8.1213 
20 —2.6259 —0.0031 0.7633 1.1838 1.5735 2.1925 5.8466 

—2.5171 0.2822 0.8991 0.7259 0.5579 0.5072 13.7021 


60 
1000 —1.0246 2.6835 4.5363 7.1745 11.0680 26.4166 хх 
Е ЕЕ ЖЬ _ 


"др 
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some of the function names. Two functions had to be added to 
statement PRB 30. With the additions it reads: 

DOUBLE PRECISION DSIN, DCOS, DEXP, DLOG, DSIGN, 
DABS 
When very large F’s with very large degrees of freedom were used 
in the calculations, it was necessary to “double precision” state- 
ment PRB 390: 


X = DBLE(ONEP) — DBLE(V2/(V2 + V1*F). 

In most statistical applications this last change would not be nec- 

essary. 

Table 2 reports the per cent error for the F statistic when com- 
puted by subroutine PROB for various degrees of freedom at the 
01 level. Again, the per cent error for ¢ and x? statistics may be 
found by consulting the appropriate column or row. 

Upon careful analysis of the more complete tables, the following 
observations were made: 

1. The per cent error becomes greater as P becomes smaller. 

2. When n < 120 and d < 120, there is apparently no systematic 

relationship between n, d, and the sign of the error. 

. When d > 3 and п = 1000, the error is positive and increases 
as d becomes larger. 

4. When d = 1000, the error is positive and increases as n be- 
comes larger. 

. When п < 120 and d < 120, the accuracy is such that at the 
probability levels computed, no incorrect decisions would be 
made if the probability level is printed out to four decimal 
places. Even at the .005 level, the per cent error never exceeded 
.2 per cent. 


e 


en 


TABLE 2 


The Per cent Error for the Е Distribution Function at Various Degrees of Freedom 
for the Numerator n and the Denominator d When Р = .010—Calculated by PROB 


n 1 2 4 10 20 60 1000 
ССОРЕ а tet Wen ое ne 
а 
1 —0.0002 0.0 —0.0001 0.0004 0.0003 0.0002 —0.0266 
2 —0.0001 0.0005 0.0005 0.0005 0.0005 —0.0001 —0.0025 
4 —0.0025 0.0005 0.0005 —0.0007 —0.0048 0.0029 0.1626 
10 0.0083 0.0017 0.0029 —0.0037 —0.0001 —0.0037 1.0794 
20 —0.0007 0.0023 0.0011 0.0005 0.0077 0.0029 3.1292 
60 0.0017 0.0035 0.0095 0.0113 0.0172 —0.0072 13.1976 
1000 1.4525 2.1356 3.8456 6.9689 11.4291 27.7554 xx 
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Subroutine PROF 

Subroutine PROF is a modification of the IBM subroutine 
BDTR (1968). BDTR computes the probability of a random vari- 
able X following the beta distribution with a particular degrees of 
freedom A and B. 

PROF uses the fact that 

d 
X= 2) {а 
where F is the calculated F to be evaluated, 
d = degrees of freedom in the denominator, which equals 2A, 

апа 


n = degrees of freedom in the numerator, which equals 2B. 

PROF is the result of the following changes in BDTR: 

1. The values received by BDTR were changed from X, A, and 
B to F, n, and d. 

2. The calculation of the ordinate at X, the test for valid input 
data, and the error messages were deleted. 

3. The following statements were added following the double 
precision statement. 


Р= 10 
IF (F*DJ*DI ТЕ. 0.0) RETURN 
A = DI/2.0 
В = DJ/2.0 
Х = DI/(F*DJ + DD 
where DI = d 
and DJ = n. 


Table 3 reports the per cent error for the F statistic when com- 
puted by subroutine PROF for various degrees of freedom at the 
01 level. As before, the per cent error for the £ and x? statistics may 
be found by consulting the appropriate column or row. 

Upon careful analysis of the more complete tables, the same ob- 
servations were made as in the PROB case. The per cent error 
tended to be slightly less overall with PROF than with PROB. 


Time 
The timing of the three subprograms was done in a loop similar 
to that below: 
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TABLE 3 


| The Per cent Error for the F Distribution Function at Various Degrees of Freedom 
/ for the Numerator n and the Denominator dwhen P = .01— Calculated by PROF 


1 2 4 10 20 60 1000 


n 

4 

1 —0.0003 0.0 —0.0001 0.0004 0.0003 0.0002 —0.0261 
2 —0.0005 0.0 0.0004 0.0002 0.0001 —0.0005 —0.0028 
4 —0.0026 —0.0001 0.0003 —0.0013 —0.0053 0.0028 0.1626 
10 0.0080 0.0014 0.0023 —0.0040 —0.0005 —0.0042 1.0792 
20 —0.0014 0.0020 —0.0009 —0.0020 0.0066 0.0018 3.1275 


60 0.0007 0.0031 —0.0072 —0.0124 0.0022 —0.0222 13.1803 
1000 1.4307 2.1352 3.2769 6.1717 10.4832 26.5622 хх 


DO 101 = 1,34 

CALL NTCLOK (ITIME) 

PRINT 8, ITIME 

READ (5, 2) (FR(J), J = 1, 19) 

DO 10J = 1,19 

CALL PROF (FR(J), DA (J), DB(I), PF) 
Р(І,Ј) = PF 

PE(I, J) = ((РЕ — TP)/TP)*100.0 

10 CONTINUE 

CALL NTCLOK (ITIME) 


NTCLOK returns the number of 1/100 seconds since midnight. 

As can be seen from the program segment, the elapsed time be- 
tween executions of NTCLOK includes more than just the execu- 
m of the probability subprogram. Thus, the reported average 
ames are not the actual average times of execution of the subpro- 
Ба Ди but they do give some idea of the relative speed of the three 
+ р ograms. ITIME was printed out 35 times. The reported 

erage times are calculated by subtracting the first ITIME 


from th ol see 
| E eur and dividing by 646, 19 х 34, or the number of F's 


Discussion 
б Bep of the three subprograms showed both PROB 
Both PROR be superior in accuracy to Jaspen's subprogram. 
| tistics that НЕМА PROF were accurate enough for the ¢ and Р sta- 
| САДУ К Probability may be reported to four decimal places 
ng an incorrect decision to be made at the normally 
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TABLE 4 
Average Time кы ы а ПРОЛИН аЛа Ват — — 5. Execution in Hundredths ој а Second 
a ee a _ .05 .01 .005 
Pee E OOD NOOO, 0.6 0.6 0.6 
PROB 1.5 1.6 1.6 
PROF 10.1 9.9 9.8 


used levels of significance and degrees of freedom. None of the 
three subprograms proved as satisfactory for the y? statistic. 

When length of program and time of execution were considered, 
PROB proved to be superior to PROF. PROF may be more readily 
available by modifying BDTR, which would be in the library of 
most computer centers. PROB would be available in most libraries 
which are depositories for government documents. Jaspen’s pro- 
gram would be best if, in a particular application, the core usage of 
the subprogram was the most important consideration. If Jaspen’s 
program was used, it would be advisable to include in the printout 
that a table should be consulted if the reported probability level is 
close to the critical value. 

In summary, three computer programs for computing the prob- 
ability of a given F with degrees of freedom were evaluated with 
respect to accuracy and speed of computation. The superior pro- 
gram proved to be subprogram PROB published by the National 
Bureau of Standards. Subprogram PROF was accurate but slower 
than the other two programs. Jaspen’s subprogram was the least 
accurate, but may be preferred when the amount of core used by 
the program is the most important consideration. 
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осон randomization tests of significance have been known 
many years (Fisher, 1935), they have been used very infre- 
tly in psychological research. Their advantages in being exact, 
ibution-free tests and in preserving without alteration the 
bility of the original data have been outweighed by two dis- 
antages. Algorithms have not been generally available and, 
if they had been, the computation would have been prohib- 
ly tedious and lengthy. The present paper describes an algo- 
m for the comparison of means for two random groups of 
ores. The algorithm was based on a brief verbal statement of the 
(Ray, 1966). A computer program for the algorithm was 
tten, and information was obtained on time requirements. Ran- 
zation test results were compared with results for an analysis 
riance. 
general a randomization test involves the comparison of an 
| ей value of a statistic with the totality, Т, of such values 
г all the logically possible arrangements of scores. If the number 
alues as extreme as, or more extreme than, the observed value 
eater than «Т, where а is the nominal level of significance, the 
hypothesis is accepted. If the number of extreme values js less 
, the null hypothesis is rejected. 
: esting the difference between means of two random groups, 
not actually necessary to generate the totality of possible re- 
Systematic examination of those arrangements most likely to 
extreme values can proceed until the number counted exceeds 
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«Т, at which time the null hypothesis can be accepted. If the num- 
ber of extreme values is in fact smaller than «Т, the examining 
procedure may have to continue until a very large number of results 
have been evaluated. In effect, a nonsignificant result may require 
relatively little computation whereas a significant result may re- 
quire a great deal. 

Neither is it necessary to complete each rearrangement in order 
to evaluate it. The algorithm starts with the computation of the 
observed difference between weighted sums, which is more convenient 
than the observed difference between means. If the two groups of 
scores are then rearranged so that one group contains the highest 
scores and the other contains the lowest, the maximum (or minimum) 
difference is achieved. Other outcomes can be obtained by a series 
of exchanges in which r values of one group are exchanged for an 
equal number in the other group. Given groups of sizes т, and ns 
the number of possible exchanges of r values is „С, ,,C,. Any exchange 
of r values reduces the maximum difference or increases the minimum 
difference by an amount that can be computed from the values in 
the exchange. The amount of the reduction or the increase will be 
called the weighted exchange difference, dw.. Each weighted exchange 
difference is counted or not, after it is compared to a criterion value, 
С, the difference between the maximum (or minimum) difference 
and the observed difference. Given a positive observed difference, 
if du, is equal to or less than the criterion value, the arrangement is 
counted; if d,, is greater than C, no count is registered. A similar 
rule applies when the observed difference is negative. The decision 
to reject or accept the null hypothesis is based on the count of 
exchange differences, 

Shortcuts are possible. They involve the use of certain strate- 
gically located exchanges, called maximum and minimum ех- 
changes. The maximum exchange of r values is the exchange of the 
т highest scores in the high group for the r lowest scores in the low 
group. The minimum exchange involves the r lowest scores in the 
high group and the r highest scores in the low group. 


The Algorithm 


The algorithm is presented in eight steps, Definitions and ration- 
ale are interspersed where necessary. 


1. Compute f, the criterion frequency or the number of out- 
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es so favorable in appearance as to be occasions for re- 


directional test, f, = avC,,, where N = n, + па. 

‘a nondirectional test, fe = (а/2)»С„. 

ompute D,, the observed difference between weighted sums. 
D, = п. У Ха w na У Ха 

У) Х is the sum of the kth group. 

f Do > 0, place the n largest scores in Group 2 and the m, 

st scores in Group 1. Order the scores in each group. Index 

Group 1 with 1 to та, running from high to low. Index 

іп Group 2 with 1 to па running from low to high. 

npute Dy, the maximum difference. 


Du = п У Ха — т У Хи. 


f Р» < 0, place the n, largest scores in Group 1 and the па 

scores in Group 2. Order the scores in each group. Index 

in Group 1 with 1 to m, running from low to high. Index 
Group 2 with 1 to па, running from high to low. 

npute |D,,| the numerical value of the minimum difference. 


М р, = п D Ха — т 2; Хи. 
If Do = 0, accept Но. 
3 Compute C, the criterion difference. 


C = Du — Do or |р, = DJ. 


T < m and ns, continue. 
Check the maximum exchange for combinations of r values 
puting the numerical value of the weighted exchange differ- 


du. = INi (Sma T биз) | 
из and Sj, are sums of r values with the highest index num- 
Groups 2 and 1 respectively. 
< C, compute „С, „С, = с 


826 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Compare f with fo. 

If f > Го, accept Ho. 

If f < fe increase value of г by one unit and repeat Step 5. 
If dwe > C, continue. 
7. Check the minimum exchange of r values. 


а... - IN (Sza = 8,1) 


where Sr; and 8: are sums of r values with the lowest index num- 
bers in Groups 2 and 1 respectively. 

If due > C, reject Ho. 

If dy, € C, continue. 

8. Obtain a weighted exchange difference, dwe, by comparing 
each possible combination of r values from Group 1 with each 
possible combination of the same size from Group 2. 

A combination of г values is constituted by г choices. The first 
choice for each group has an index variable with an initial value of 
one. Each succeeding choice has an initial index value of one 
more than that of the preceding choice. The index of each choice 
for each group has a terminating value equal to the group size 
minus the number of choices to follow. After termination on any 
given index, except the first, the index of the preceding choice is in- 
creased by one unit and the process continues with the given index 
taking on а new initial value of one more than the preceding and 
other succeeding indices taking on increasing values. All combina- 
tions of Group 2 are processed for each combination of Group 1. 

If du, > C for any exchange, repeat Step 8 for the next exchange. 

If due < C, compute f = f +1. 

If f > fea accept Но. 

If f < fe, repeat Step 8 for the next exchange. 

The requirements for a systematic coverage of all possible ex 
changes of т = 1 and т = 2 are represented in Table 1. 

After termination on the first index for Group 1, at which time 
the exchanges of г values will have been exhausted, increase 7 by 
one unit. Compare г with n; and na. 

If r > n or ng, reject Но. 
If r < n; and ns, repeat Step 7. 

The null hypothesis is rejected when а minimum exchange check 
that does not count is encountered. 


^00 ([mpmeem C —— Qn — — — 2 ——— ———, a 
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TABLE 1 
Exchanges for r = 1 and 2 


r=2 
Group 2 


Xi, Xs Хи, Хи oOXo-08 Xn 


hypothesis is accepted when the number of favorable 
exceeds the criterion frequency. 


Tests of the Program 

omputer program for the algorithm was written in the lan- 
FORTRAN IV and tested with compilers G and H. The 
ith extensive annotation is available in an unpublished 


E 18 sets of artificial data. Comparisons of decisions were 
those for an analysis of variance. j 
sets of data were constructed for each of six combinations 
р sizes. The first set consisted of two groups of two-digit 
8 produced by a random generator. Artificial effects were 
in the second set by adding a constant to the numbers 
nd group. The third set was produced by adding a larger 
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TABLE 2 
Test Information and Results 
IBM 360/50 
Time in Minutes 
Compiler 
m ла а а Decision G H 
5 5 .05 00 А 0.18 0.16 
30 А 0.19 0.16 
40 А 0.19 0.18 
10 5 00 А 0.19 0.16 
30 А 0.19 0.16 
40 R 0.19 0.16 
15 5 00 A 0.19 0.18 
А 0.20 0.18 
20 Е 0.26 0.21 
10 10 00 А 0.22 0.20 
10 А 0.26 0.21 
20 R 0.55 0.36 
15 10 00 A 0.33 0.25 
De ok 1.92 1.25 
30 R 6.34 3.75 
15 15 -005 00 A 4.90 2.80 
30 A 4,75 2.71 
60 R 120.08* 119.73 
15 10 .025 30 R 0.37 
15 15 -005 00 A 0.26 
60 R 9.76 


* Not completed and no decision. 
Mi, пз = sizes of groups. 

= level of significance, 

a = additive constant. 

А = accept Ho, 

В = reject Ho. 


constant to the original numbers of the second group. Table 2 shows 
the various combinations of group sizes and the additive constants 
88 well as the results obtained from the tests. 

All tests are represented as directional tests with either .025 ог 
-005 as the level of significance. However, the tests are interpretable 
as nondirectional where groups are equal in size and the distribu- 
tion is symmetrical, in which case а is double the probability listed. 
Where groups are unequal, in which case the distribution may be 
asymmetrical, a nondirectional test requires a special application of 
the program. It is convenient to use а as in a directional test to eX- 
amine both ends of the distribution for extreme positive and nega- 
tive values. If f > f, = «Т on the first side examined, accept Ho 
without testing on the other side. If f < ў, on the first side and f > 
fo on the second side, accept Но. If f < f, on the first side and again 


Э, 
ўн 
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` on the second side, then combine as X f. If X f > fe, accept Ho. If 


5] < fo, reject Ho. 

Randomization tests were done on the IBM 360/50 computer, 
initially with the G compiler and later with Н. Conventional anal- 
yses of variance were also performed. Table 2 summarizes the re- 
sults in terms of decisions and times. One randomization test (та = 
л» = 15; а = 60) was not completed with the G compiler because 
of the excessive time required. It was completed with the Н com- 
piler and also on the IBM 360/85, along with the test for та = 15, 
па = 10, a = 30, and the test for n, = na = 15, а = 0. The results 
_ for the three tests performed on the IBM 360/85 are given at the 
bottom of Table 2. Time required was reduced with the H compiler 
and drastically reduced on the 360/85. The typical time required 
for the analysis of variance was .15 minutes. 

The decisions made on all 18 randomization tests agreed with 
those of the analysis of variance. 


Applications 


The test can be employed where the means of two random 
. Broups are to be compared, as in a completely randomized experi- 
. mental design with two treatments. It can also be used where the 
| means of three random groups are to be compared, as in a com- 
pletely randomized design with three equally spaced levels of 
treatment. With three levels, linear and quadratic comparisons 
Сап be evaluated. The linear comparison is the difference between 
the sums for the first and third levels of treatment. The quadratic 
Comparison is the difference between the weighted sum for the sec- 
‘ond level and the combined sum for the first and third levels. 
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POLAC: A COMPUTER PROGRAM TO COMPUTE 
AUTO-, CROSS-LAGGED, AND POOLED 
$ WITHIN-CELL CORRELATIONS: 


ROGER BAKEMAN 
The University of Texas at Austin 


1. Function: Program POLAC computes auto- and cross- 
lag intercorrelation (Ralston and Wilf, 1960; Holtzman, 
1963; Veldman, 1967; Campbell, 1963). The program in- 
dependently computes lagged correlation matrices for any 
number of treatment conditions (cells) and then computes 
a pooled within cell lagged intercorrelation matrix based on 
all cells (Winer, 1962, p. 604). Missing observations, as 
defined by the user, may be excluded from the computa- 
tions. The user specifies the lag(s) desired. Punched output 
may be obtained for use as input to multiple regression 
programs: in the absence of instructions to the contrary, 
punched output is compatible with SPSS (Statistical Pack- 
age for the Social Sciences; Nie, Bent, and Hull, 1970). 

The pooled correlation is conceptually (although not com- 
putationally) an “average” correlation: it is as though we 
noted the correlation between variable X and variable Y 
in each of several treatment conditions or cells, and then 
computed the average correlation between X and Y for all 
у cells. 

The pooled within-class correlation differs computation- 
ally from the Pearson correlation only in that deviation 


O 
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2.2.3.3 


scores are computed by subtracting from each observed 
score, not the grand mean, but the mean for the given cell. 


& м 
>; >; (X; i Xy. a Y) 
УТ этр раат атут, тугаа 
{LE Sex - x] hr- |} 
i=l i=l j=l d-1 
Where degrees of freedom for the Pearson correlation are 
represented by N-2 (where N represents the number of ob- 
servations), the degrees of freedom for the within-class 


correlation are computed by summing the degrees of free- 
dom for each cell: 


к 
ај = 2 (№, – 2) 
INPUT 
TITLE (col. 1-80): The first card contains any identify- 
ing information desired. 
PARMS: The second card contains parameters for a given 
тил, 
NVAR (col. 1-5): The number of variables (may not 
exceed 50). 
NCEL (col. 6-10: The number of cells. 
INPUT OPTIONS (if all blank, no options are de- 
sired) : 
Variable labels (col. 12): If 1, variable label card(s) 
precede the cell control card (see below, 2.6). 
Input unit (col. 13): If not zero or blank, the data 
(cell control cards and observations) are to be read 
from this Fortran unit, instead of from the standard 
input unit (5). 
Data order (col. 14) : If 0 or blank, the data represent 
scores on each of a series of variables for a given 
observation. The series of scores for each observation 
may be punched across one or more cards, but each 
different series must begin on a new card. If 1, the 
data represent repeated observations on a single 
variable. The series of scores for each variable may 
be punched across one or more cards, but each dif- 
ferent series must begin on a new card. 


224 


2241 
2242 


E 
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Missing data (col. 15): If 0 or blank, there are no 
missing data, i.e., all zero and blank data items are 
treated as valid. If 1, а missing data descriptor card 
follows the format card(s) (see below, 2.5). 
OUTPUT OPTIONS (if all blank, no options are de- 
sired) : 
Punch format (col. 19): If 0 or blank, any punched 
output will use the program supplied (and SPSS re- 
gression compatible) format. If 1, two punched out- 
put format cards follow the input format card (see 
below, 2.4). 
Punched output (col. 20): If 0, no punched output 
is desired. If 1, means, sigmas, and the pooled with- 
in-cell correlation matrix are punched. If 2, these 
statistics are punched for each cell. If 3, both cell 
and pooled statistics are punched. 
LAG (col. 20-25): The lag may be any plus or minus 
number (less than the number of observations), or it 
may be zero (in which case unlagged correlations are 
computed). Additional lags may be given in col. 26-30, 
col. 31-35, etc., to col. 76-80; thus as many as 12 dif- 
ferent lags may be specified. If rows and columns re- 
fer to the correlation matrices printed, then a positive 
lag correlates row variables with column variables ob- 
served earlier, and a negative lag correlates row vari- 
ables with column variables observed at a later time. 

INPUT FORMAT: The third card contains the input 
format (beginning with a left parenthesis in column 1). 
It must specify the appropriate number (NVAR, or 
NOBS if col. 14 of card 2 is 1) or real (i.e., floating point 
or ^F") format items. If necessary, the input format can 
extend over two cards. In that case, col. 80 of the first 
card must contain a dollar sign ($). 

PUNCHED FORMATS (optional, must be present if col. 
19 of card 2 is 1): If the user wishes to override the pro- 
gram supplied punched output formats, (8F10.3) and 
(8 F10.7), he may do so by supplying his own formats. 
These two formats follow the input format card; the first 
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is used for means and sigmas, the second for the cor- 
relation matrices. 

2.5 MISSING DATA DESCRIPTOR (optional, must be pres- 
ent if col. 15 of card 2 is 1): Each column of the missing 
data descriptor card indicates whether blanks and/or 
zeros are to be regarded as valid or as missing data for 
the variable corresponding to the column number. If a 
column is 0 or blank, then zeros and blanks are both 
treated as valid data (for that variable). If 1, then both 
zeros and blanks are treated as missing data. If 2, only 
blanks are treated as missing data. 

2.6 VARIABLE LABELS (optional, must be present if col. 
12 of card 2 is 1): Labels may be given for each variable; 
they are then used to label printed output. The first label 
is punched in col. 1-8, the second in col. 9-16, etc., to col. 
73-80. Thus 10 labels can be given on one card. NVAR 
labels must be given; as many variable label cards ав 
required should be used. 

2.7 CELL CONTROL CARD, NOBS (col. 1-5) : specifies the 
number of observations for the following cell. A cell con- 

trol card must precede the data cards for each cell. NOBS 
may not exceed 50. 

3. Source decks and program listings are available from the 

author. 
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SETCHEK AND EDICHEK: COMPLEMENTARY 
OGRAMS TO DETECT ERRORS IN PUNCHED DATA 
% CARDS 


CEDRIC G. BULLARD 
University of New South Wales 


) common sources of error in the computer analysis of large 
tions of data in the behavioral sciences are (1) failure to have 
sely the correct number of well-ordered punched cards for each 
ibject, and (2) having codes punched which do not represent 
ate codes for the variables under investigation. 
‘Just one card is missing from one set in a collection, then in 
ecution phase of any analysis program, the next card will be 
its stead and each of the following cards will be read in 
of its preceding neighbour until the end of the deck or until 
lus or misplaced card is encountered. In any case, the error 
ds unpredictable. A similar situation occurs where a surplus 
| is encountered (except where it occurs in close proximity to & 
сага and tends to minimise its effect). This may be a card 
laced to a position nearer to the front of the deck, or, as often 
а card on which errors were detected during the punching 
or during the error checking stage and inadvertently left 
deck together with its replacement card. The situation is 
unded when numerous sets of cards are shuffled out of 
. f the only defect in а deck consists of shuffled sets, then the 
3 actor is roughly proportional to the number of shuflled sets, 
uffling is unchecked, then this error factor is also unpredic- 
tempted detection of incomplete or shuffled sets by perusing 
) E cards themselves or a printed listing, is both tedious and 
mple of the second source of error would be where a subject 
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is required to respond to an item on a rating scale on which the 
possible responses are assigned integral values of 0-5, but a 9 is 
punched in the appropriate data card column. Errors of this type 
may result from (a) inaccurate assignment of the response value 
by coders, or (b) inaccurate transcription of an assigned value 
on to a punched card column. Whilst no computer program can 
detect an error where a legitimate code has been punched but is 
actually incorrect, many package analysis programs incorporate an 
editing option whereby the punched value is tested to see if it lies 
within specified minimal and maximal values. However, no error is 
detected if the punched value lies between these limits but is not a 
legitimate code. 

The programs described here are complementary programs de- 
signed to detect errors from the above sources. They may be used 
sequentially on a given data collection, thus minimizing the machine 
time required to finalize two essentially independent operations. 
Program SETCHEK eliminates the error-prone process of visual 
scanning for missing, additional or shuffled cards, whilst program 
EDICHEK thoroughly edits by allowing not only the customary 
setting of lower and upper bounds of legitimate codes, but also the 
stipulation of four regions of nonlegitimate codes, and option suffi- 
ciently flexible to cope with most research designs. 

Program EDICHEK identifies the respondent when an error 
is detected so that the error may be rectified by direct reference to 
the original data record (questionnaire, scale, etc.) so that when ап 
analysis program is used, instances of deletion, dumping into 
residual categories and inflation of minimum or maximum value 
categories are avoided. 


Input 


The Job Detail deck of program SETCHEK consists of 2 
single card containing, principally, two integers and an I-format 
statement for the subject’s identifying number and the sequential 
card number. 

The Job Detail Deck of program EDICHEK is comprised 28 
follows: 


a. a simple Parameter Card containing three integers, 
b. one or more Format Cards describing the subject’s identifying 
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number (first or last) and the fields to be edited in I-format, 
and 

с. one Field Editing Card for each field to be edited. These are 
simple in format, containing, at most, eleven integers. 


Output 


Program SETCHEK produces a warning message for each in- 
correct card set and tallies the number of subjects with correct sets. 

Program EDICHEK produces an Editing Specification Table, a 
warning for each incorrect field for each subject and a tally of the 
subjects processed. 


Limitations 

The programs are written in FORTRAN IV and are limited to 
integral data. Program EDICHEK can be easily extended to 
handle floating-point data but for general purposes this makes the 
Preparation of Job Detail Cards a little cumbersome. 

Program SETCHEK has the following specific limitations: 
() 1<3 < 10,000] Where s is the number of subjects, 
M2<c < 99 с is the correct number of 
(iii) 2 < (s.c) < 20,000 cards per set. 


Program EDICHEK has the following specific limitations: 


@ 15:5 15 Where с is the number of cards in a 
(i) 1 <f < 1,200 subject’s set, ј is the number 
(i) ¢<n< 999 of fields to be edited, and n is 
a legitimate code in a given 
field. 
Availability 


Copies of this paper, operations manuals and source listings can 
be obtained by writing to Cedric Bullard, School of Sociology, 
University of New South Wales, Post Office Box 1, Kensington, 
2033, Australia. 
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А CDC—3600 PROGRAM FOR COMPUTATION OF 
MAXIMUM PHI WHEN OBSERVATIONS ARE 
MISCLASSIFIED! 


ROBERT F. BORUCH 
Northwestern University 


The maximum value of the phi coefficient, an index of the rela- 
tion between two dichotomous variables 7 and j, is 


= [21.6 
Фах ER 


where 1.00 > p, > p, > .5, ф = 1.00 — p, and а, = 1.00 — ру. When 
the measurement process i$ a fallible one, the misclassification of 
observations will depress or increase the observed max (relative to 
its true value) ; the change depends on the rates of misclassification 
of true positive and of true negative units. It is usually troublesome 
to compute all plausible values of true ¢max When the misclassifica- 
tion rates are estimated from empirical data or guessed a priori. In 
order to facilitate appraisal of the credibility of the observed ф and 
Pmax coefficients, we have devised a FORTRAN IV program for gen- 
erating values of observed max as а function of true dmx and of 
Positive and negative misclassification rates. Computations are 
based on the Cochran (1968) model for observed proportions (ро) 
as а function of true properties (р) and of the likelihoods of false 
Negatives and false positives (v, and Vp, respectively) : 


Do = De + (1 — Р) — aP: 
Output 


Tabulated values of maxima for observed p; > p; > .5, where ру 
= 50, 55 --- 95 and р; = .50, .55 -*- .95, are provided for the fol- 
lowing conditions. 

ААА 
*Supported by NAS grant #GS23073X. 
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1. Proportion of false negatives in $ and j equal to 0, .050, .075, 
-100, .125, .150 for all i and j; 


2. Proportion of false positives in 1 and 7 equal to 0, .050, .075, 
.100, .150 for all i and j. 


Adjusting the increments in p; ру, v, and v, is optional. 
Availability 
А listing of the program is available from the author. 
REFERENCE 


Cochran, W. G. Errors of measurement in statistics. Technometrics, 
1968, 10, 637-666. 
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Oscar К. Вигоз (Ed.). The Seventh Mental Measurements Year- 
book, Volumes I and II. Highland Park, N. J.: Gryphon 
Press, 1972. Pp. xl + 1,986. $55.00 plus postage. 


The introductory chapter of The Seventh Mental Measurements 
Yearbook presents again the objectives of the yearbooks. In some- 
what abbreviated form they are: (a) To provide information about 
tests published as separates throughout the English-speaking world; 
(b) to present critical test reviews by test and subject specialists; 
(c) to provide extensive bibliographies on the construction, use, and 
validity of specific tests, (d) to make available the critical portions 
of test reviews appearing in professional journals; (e) to list new and 
revised books on testing with evaluative excerpts from reviews of 
these books. Other and “crusading” objectives include: (f) to impel 
test authors and publishers to publish better tests and better data 
relevant to them; and (g-j) to promote greater understanding 
among test reviewers and test users of the values and limitations of 
standardized tests, of methods of appraising them, and of the need 
to be suspicious when adequate data on their construction, validity, 
uses, and limitations are not provided. M 

Buros contends that success in attaining the crusading objectives 
"has been disappointingly modest.” He states on page xxvii: _ 

Test publishers continue to market tests which do not begin to 
meet the standards of the rank and file of MMY and journal 

Teviewers, at least half of the tests currently on the market should 

never have been published. 1 

While the above statement may be justified, an analysis of the 
Teviews in successive Mental Measurements Yearbooks reveals that 
tests have improved over the years by the fact that the proportions 


of adverse comments by increasingly sophisticated reviewers has 


progressively decreased. This is most true of achievement and 
aptitude tests and least true of personality instruments. An analysis 
of the yearbooks up to and including The Fifth Yearbook was 
Teported in 1963 by this reviewer and John M. Beck. Such an analy- 
sis of the content of the ММУ reviews should be repeated to obtain 
data concerning the characteristics and trends of criticism. 

Also contained in the introductory chapter of The Seventh Year- 

ook are brief descriptions of the first six MMY’s, Tests in Print, 
and the MMY monographs described toward the end of this review. 
Ten tables report the numbers and percentages of new, revised, or 
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supplemented tests by major classifications, the numbers of original 
and excerpted test reviews and the numbers of references in the 
ММУ?» and other publications in the series. 

Especially important in this chapter are the detailed “Suggestions 
to MMY Reviewers” and the instructions on “How to Use This 
Yearbook.” 

More than half of the tests, 55.6 per cent, are new tests listed 
for the first time in The Seventh Yearbook. In The Sixth Yearbook 
51.5 per cent of the tests are listed for the first time. It is interesting 
to note the numbers of tests, authors, and reviewers in English- 
speaking countries other than the United States, England, Scotland, 
Canada, and Australia are examples. 

In recent years a number of testing programs have developed 
for the purposes of accreditation of nontraditional study or for 
guidance and placement on the junior college level. These include 
the College-Level Examination Program, CLEP, administered for 
the College Entrance Examination Board by Educational Testing 
Service and the Junior College Placement Program of Science Re- 
search Associates, These programs are among those reviewed in 
The Seventh Yearbook pages 1009-1046. 

Another interesting development is the increasing use of high- 
speed electronic scoring machines such as the IBM 1230 and Digitek 
and of scoring services, for example, the services provided by the 
Measurement Research Center at the University of Iowa, MRC, 
the National Computer Systems, NCS. (See pages xxxix and 997- 
1007.) Both MRO, Testscor, and NCS provide а scoring service for 
the Strong Vocational Interest Blank for Men. The Minnesota 
M ultiphasic Personality Inventory is scored and computerized inter- 
pretations are provided by several services. (See pages 250-266 and 
the examples in the review by Benjamin Kleinmuntz.) 

There are numerous excellent reviews in The Seventh Yearbook. 
Among the most interesting are the reviews of Analysis of Learning 
Potential by Lee J. Cronbach and by Arthur R. Jensen. Other 
examples of reviews of more than usual interest are those of the 
Rorschach by Alvin G. Burstein and Charles С. McArthur. Also 
of unusual interest is the review of the Illinois Test of Psycho- 
linguistic Abilities by John B. Carroll. Many more equally thought- 
provoking reviews could be listed, 

„The publication of the 1938 Yearbook was preceded by three test 
bibliographies covering the years 1933-1936. The series of year- 
books has also included Tests in Print of 1961 which was reviewed 
by William B. Michael in the Spring 1964 issue of Educational and 
Psychological Measurement and excerpts of this review are те- 
printed on page 1428 of The Sizth Yearbook. A similar volume will 
appear in 1973. 


Reading Tests and Reviews, published in 1968, contains all of the 
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reading test reviews and references of the first six mental measure- 
ment yearbooks and lists thirty-three new or revised tests not in- 
cluded in The Seventh Yearbook. Its review by Frederick B. Davis 
appeared in the Autumn 1969 issue of our journal and excerpts are 
reprinted on pages 1593-94 of The Seventh Yearbook. Similarly, 
Personality Tests and Reviews reprints all of the reviews and refer- 
ences of the first six yearbooks and lists eighty new or revised tests 
and 7,116 references not included in The Seventh Yearbook. A 
special review by Fred Damarin of this monograph appeared in the 
Spring 1971 issue of our journal. Excerpts from this review are 
reprinted on pages 1589-90 of The Seventh Yearbook. 

In spite of the monographs described above containing informa- 
tion about new, revised, or supplemented tests since the publication 
of The Sixth Yearbook, but not included in The Seventh, the latter 
is the larger. The number of original test reviews has increased only 
from 795 to 798, but the number references on the construction, use, 
and validity of specific tests has increased from 7,967 to 12,372. The 
Minnesota Multiphasic Personality Inventory has 16 double-column 
pages of references in The Seventh Yearbook while the Rorschach, 
the Edwards Personal Preference Schedule, and the Thematic Ap- 
perception Test have from 6 to 8% double-column pages each. 

The Wechsler Intelligence Scale for Children and the Wechsler 
Adult Intelligence Scale have about 10 double-column pages each 
of references. Each of the АСТ Test Battery of the American Col- 
lege Test Program, the Cooperative School and College Ability 
Tests (SCAT), and the Stanford-Binet Intelligence Scale, Third 
Revision, has between 4 and 5 double-column pages of references. 
The Strong Vocational Interest Blank for Men and the one for 
women have a total of 11 double-column pages of references. Given 
а use for such a list of references, it is extremely convenient to have 
such lists immediately available in this and earlier ММ Y's. 

It has been suggested, for the sake of economy, that the lists of 
references and the more than 300 pages of excerpts from reviews of 
current books on measurements should be omitted from future 
mental measurements yearbooks. But such omissions, in the opinion 
of this reviewer, would be a false economy and would greatly lessen 
the impact of the yearbooks on the authors and publishers of tests. 

hey would greatly decrease the usefulness of the yearbooks in the 
Selection of tests, particularly for research purposes. The student 
and teacher of measurements would be seriously handicapped in 
acquiring more than textbook knowledge of the field. T 

The last pages of Volume II of The Seventh Yearbook contain 
the very useful Periodical Directory and Indez, Publishers Directory 
and Index, Index of Book Titles, Index of Test Titles, Index of 
Names, and, finally, a Classified Index of Tests. This index in 
Teality is “an expanded table of contents of the tests listed... . 
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Stars indicate tests not previously listed in an MMY; asterisks 
indicate tests revised or supplemented since last listed in an MMY. 
Authors of test reviews written for this volume and the number of 
excerpted journal reviews are also presented.” Apart from sum- 
marizing what tests are listed in The Seventh Yearbook, this index 
should make anyone having comprehensive knowledge of educa- 
tional and psychological measurements realize how expertly re- 
viewers of tests have been selected. 

With the reissue in 1972 of The Nineteen Thirty-Eight and The 
Nineteen Forty Mental Measurements Yearbooks, all seven of the 
yearbooks are now in print. While the first two yearbooks have 
historical value, they are also useful in helping to maintain complete 
coverage of tests, test reviews, and bibliographies on specific tests. 
The second of the yearbooks is of interest in its defining of policies 
and practices and in its classification of tests similar, but expanded, 
in more recent volumes. The Nineteen Forty Yearbook, first pub- 
lished in 1941, reports the reception given to the reviews in The 
1938 Yearbook. While many of the excerpts quoted are compli- 
mentary, others are filled with complaints and excuses of test au- 
thors and test publishers. 

The review of The Fifth Yearbook by Charles R. Langmuir in 
the December 1960 issue of Contemporary Psychology and re- 
printed in full on pages 1421-1424 of The Sixth Yearbook is notable 
for its discussion of the historical development of The Mental 
ее Yearbooks and its biographical note concerning their 
editor. 

1% is difficult to find unused superlatives to characterize The 
Seventh Yearbook and its predecessors. They are indeed weighty— 
36 pounds on my bathroom scale. I can’t believe I read the whole 
thing, but I have read enough to conclude by saying “Oscar, you 
are incredible!” To use an unused superlative, the Mental Measure- 
ments Yearbooks deserve to be ubiquitous. 


REFERENCE 
Max D. Engelhart and John M. Beck. The Improvement of Tests. 
The Sixty-Second Yearbook of the NüBohal Society for the 


Study of Education, Part П. Chicago: Universi i 
Тан Ar pris Chicago: University of Chicago 
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Duke University 


Herman Burstein. Attribute Sampling: Tables and Explanations. 
New York: McGraw-Hill, 1971. Pp. x + 464. $18.50. 54 


In constructing these tables Burstein has addressed himself to а 
problem often met in the behavioral sciences: sampling a binomial 
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_ population. The goal of the sampling procedure is to estimate the 
_ proportion of the population which possesses a characteristic that 
| _ iS either present or absent. For example, what is the proportion of 
blue-eyed people in Boston? What proportion of fourth-graders in 
_ the state of California exhibits reading disabilities? How many 
— tice per hundred in a laboratory colony are prone to audiogenic 
Seizures? 

The usual procedure is to draw a random sample of n observa- 
tions from a population of size N and count the members of the 
sample which exhibit the characteristic of interest. Let this number 
be c; then in a single sampling scheme c/n is taken as the best 
estimate of the population proportion. 

б In many instances, however, such a point estimate is not 
| enough: the researcher needs to be able to specify with a particular 
ы level of confidence the limits within which the proportion will fall. 
$ Suppose, for example, funds were being allocated to hire reading 
| Specialists to treat dyslectie students in the primary grades and 
Y that each specialist could treat a fixed number of students. The 
— Tesponsible administrator might decide he needed to be 95 per cent 
confident of providing enough specialists to treat all dyslectic stu- 
© dents in the school system. Or, if funds were short, he might elect 
to hire sufficient personnel to be 99 percent confident of being able 
to treat at least the minimum number of dyslectic children likely to 
appear in the schools. The former example would involve calculating 
—— 8n upper confidence limit for the proportion of dyslectic students in 
_ the school system, and the latter example would involve calculating 
& low confidence limit. If a compromise had to be effected between 
_ the two alternatives, the problem might well involve a two-sided 
| Confidence interval as well as a point estimate. 
— Ordinarily, the behavioral or social science researcher would al- 
. most automatically use a normal curve approximation to obtain 
his confidence limits, But, for small samples and extreme values of 
the population proportion, such approximations could be seriously 
In error due to lack of congruence between the binomial and normal 
distributions in these circumstances. The Poisson distribution, 
- Which has also been used to approximate binomial values, is less 
than satisfactory for small samples or c/n values close to .5. It is 
to the solution of such difficulties that Burstein’s first set of tables 
48 aimed. $ 
, The tables *. . . permit determination of binomial confidence 
its for a sample of any size and any value of c/n with a relative 
curacy of at least .999; exact confidence limits are supplied for n 
20 or less." Burstein duly notes a few sets of tabular values for 
"Which relative accuracy is less than .999, although the decrement 
_1п accuracy is generally negligible. This degree of accuracy assumes 
`8 very large №: one that is at least 19 times larger than the sample. 
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For samples which constitute 5 per cent or more of the population, 
sets of finite population corrections are supplied. The tables provide 
for two-sided confidence limits at the 60%, 80%, 90%, 95%, 98% 
and 99% levels; one-sided limits can be calculated for the 80%, 
90%, 95%, 97.5%, 99% and 99.5% levels. 

Burstein’s second and third sets of tables apply to two related 
problems referred to in the book as “proportion sampling” and 
“acceptance sampling,” respectively. In both cases one wants to 
know how large a sample is required in order to keep the error 
with which a binomial population proportion is estimated within. 
specified bounds. 

In proportion sampling one anticipates a value of c/n, chooses a 
confidence level, and then decides upon a tolerable error level (dis- 
tance between c/n and the confidence limit or limits). Given these 
choices, and assuming c/n is as anticipated, the tables allow one to 
arrive at an appropriate sample size for most anticipated proportions 
and any anticipated error margin. Two-sided intervals at the 60%, 
80%, 90%, 95%, 98% and 99% confidence levels may be calculated 
from the tables as well as one-sided intervals at the 80%, 90%, 95%, 
97.5%, 99% and 99.5% levels. Separate tables for 0 < p < .25 and 
25 < p < .50 are provided. The five proportion-sampling tables 
vary somewhat in accuracy according to the approximation pro- 
седите upon which a given table is based. Sample size errors will 
vary from —25% to +10% with an average error of perhaps +4 
per cent. There are some discontinuities in the tables with respect 
to anticipated sample proportions, but these are not serious. It is 
again assumed that N is very large or at least much larger than the 
sample; if not, a finite population correction is furnished. 

6 Acceptance sampling, in Burstein’s terms, is concerned with the 
situation where, for example, a researcher purchases solid state 
programming modules in lot quantities. In order to keep operating 
standards up and maintenance costs down, he does not want to take 
a risk greater than В of accepting a lot with a defective rate greater 
than po (I am using Burstein’s notation). But, if this is determined 
by sampling methods, the researcher takes the additional risk @ 
of rejecting a lot with satisfactory rate Ра. The situations in which 
only рз and 8, or only p; and а are of importance are referred to as 
“quasi-acceptance sampling.” 

Use of the tables presupposes that the lot will be 10 to 20 times 
larger than the sample; otherwise a finite population correction is 
used. It is also assumed that p, and ps are both 25 or less, or that 
the single value is .25 or less in the quasi form. A separate table is 
provided for the special case where p equals zero. J ointly, the tables 
allow calculation of sample sizes for a and B risks of 0.5%, 1%, 
2.5%, 5%, 10% and 20% -and for all combinations of these risk 
levels in the case of acceptance sampling. The author states that 
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` the tables provide values which “. . . result in substantially accurate 

— sample size . . .". 

© It should be made clear that one does not look up confidence limits 

| or sample sizes directly in Burstein’s tables. The tabular entries are 
rough approximations which, together with sample size, anticipated 
proportions (s), risk level, etc. as the case may be, are entered into 
formulas that then provide very much better approximations to 
required values. The virtue of this approach, as opposed to what 
would be a truly staggering listing of particular values, is а max- 
mum amount of information with a minimum number of pages. The 
computation required of the user is modest, and the many worked 
examples in each section should clarify any problem that would 
ordinarily arise. 

Several questions need to be asked in evaluating Burstein's work. 
The most obvious has to do with the accuracy of the tables: are 
the values correct, at least within the limits claimed by the author? 

_ An exhaustive answer to this question would entail research nearly 
_ 4s formidable as Burstein's own. To answer this query for myself at 
а much more modest level of confidence, I employed three ap- 
proaches. First I scanned the issues of the Journal of the American 
| Statistical Association since 1968 (when Anderson and Burstein’s 
approach was published) for procedures of references which prom- 
ised consistently greater accuracy than Burstein’s. There appeared 
to be none. My second approach was to compare selected values 
from Burstein’s tables with those in works such as Cooke, Lee, and 
_ Vanderbeck’s “Binomial Reliability Table (Lower Confidence 
Limits for the Binomial Distribution).” Burstein’s values com- 
| pared favorably in each case that I examined. Finally, I computed 
_а few values directly for half a dozen simple cases. Again, 
Burstein’s values appeared to be completely satisfactory. These 
admittedly modest and unsystematie approaches to assessing the 
- Accuracy of Burstein's work nevertheless resassured me. 
А second evaluative dimension upon which the tables need to 
_ be weighed relates to their organization and the effort needed to use 
- them. The former is, quite simply, admirable; and, as indicated 
- Above, the latter is negligible in comparison with the accuracy 
- provided. д 
__ Finally one must ask whether there may not be more efficient ways 
to get at the information the procurement of which Burstein’s efforts 
аге intended to facilitate. That is, Burstein presupposes a single 
- Sampling scheme. But there are many, perhaps a majority, of cases 
in which multiple or sequential sampling procedures might provide 
better information at a lower cost. To the extent that the availa- 
bility of these tables, accurate, well-organized, and easy-to-use as 
they are, discourages the utilization of more efficient procedures, 
- Burstein's efforts must be viewed as antiproductive. This remark is, 
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however, intended as a comment and a warning to the uncritical 
user rather than an adverse reflection on this excellent and useful 
work. 


James A. WALSH 
Iowa State University 


W. Grant Dahlstrom, George Schlager Welsh, and Leona E. 
Dahlstrom. An MMPI Handbook—Volume I, Clinical Inter- 
pretation. (Rev. ed.) Minneapolis: The University of Minnesota 
Press, 1972. Pp. xxvi + 507. $18.75. 


"This is Volume I of a revised Handbook; Volume II is scheduled 
to appear in 1973. Expansion to two volumes has been deemed 
necessary because of the number of studies published during the 
decade since the first edition appeared. This volume is concerned 
with material pertinent to interpretation of individual MMPI 
records, while the second will be concerned with the use of the test 
to select subjects for a study or to evaluate the effects of treatment 
or some other manipulation. The authors are eminently qualified for 
their role, having been associated in research and practice with the 
test since their graduate school days at Minnesota in the late '40's. 

Volume I of the Handbook is divided into three parts: ad- 
ministration, scoring, and categorizing profiles; interpreting the 
special scales which are designed to measure the patient’s ability 
and intent to cooperate; and interpreting the profile of scores, 
particularly scores on the nine original scales. There are, in addition, 
15 appendices which list norms for various scales, an index of the 
items by key words, rules for profile discrimination and frequencies 
of patterns identified by the two highest scales. Finally, there is a 
bibliography of about 600 items with the promise of a more complete 
list of MMPI references in Volume II. 

The Minnesota Mulitphasic Personality Inventory was developed 
by 8. В. Hathaway and J. С. McKinley in the late 1930’s and pub- 
lished as the Minnesota Multiphasic Personality Schedule in 
1943. To the validating scales and nine clinical scales was added 
Drake's Si scale, a measure of social introversion. In the original 
work and subsequently, heavy reliance has been placed on empirical 
item selection, i.e., collecting the answers of a group of people who 
belong in a given diagnostic category and identifying the items 
which this group answers differently than a sample of normal 
subjects. This strategy has proven to be quite productive. With 
cross-validation to eliminate items different due to spurious factors, 
scales derived by this method appear to be able to categorize new 
Subjects from similar populations with considerable accuracy. 
Whether а scale derived by these methods will work with different 
populations and whether it measures anything of general signifi- 
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— cance are matters for later research. These issues must also be the 
central concern of a handbook on the clinical interpretation of this 
test. | 
In the Foreword to the first edition, reproduced here, Hathaway 
offers the rather startling opinion that, while he and McKinley 
| hoped that with five year’s experience they might increase the test’s 
validity, he now doubted that it would be possible to improve 
the MMPI appreciably, even to hold the test’s current validity. 
In this spirit the original nine clinical scales have been preserved, 
with an accumulation of interpretive lore. 
Та Part I the authors take no fewer than 100 pages to present 
_ administration, scoring, and construction and coding of the profile, 
(with a brief introduction to the strategy of grouping by profile type. 
| Patients are categorized, by and large, by their two highest clinical 
scales, more or less regardless of elevation as long as they are above 
— a T-score of 70, or two standard deviations above the mean of the 
_ original Minnesota normal sample. 
Part II is concerned with evaluation of the scores on the four 
_ Scales designed to measure some of the distortions which might 
— be introduced by the subject’s willingness and ability to respond 
_ accurately to the items of the test. 
The third part of the Handbook is concerned with a description of 
_ each of the nine scales plus Drake's Si scale, the characteristics of 
patients associated with various profiles grouped by the two highest 
- Scales and, finally, a description of computerized interpretation 
with an extended presentation of one case. In their scale-by-scale 
T discussion the authors devote very little attention to the specific 
— nature of the clinical group used in the development of each scale 
but a good deal of attention to personality differences which a num- 
ber of investigators have found associated with high and low scores 
_ Of normal subjects on various scales. There are many apparent con- 
tradictions in the adjectives which raters apply to those who are 
— high on a given scale. One is led to ask whether an investigator 
“interested in personality differences would not be well advised to 
‘Use a personality test designed and validated for that purpose, such 
as the personality tests developed by Gough and by Jackson. | 
The heart of the matter of clinical interpretation, included in 
Chapter 7, consists of a discussion of the behavior noted among 
Patients within groups defined by the two highest scales of their 
MMPI profile. Numbering the clinical scales as they appear on the 
Profile, the user identifies a profile by & code of two digits. Thus, a 
rofile with Depression highest and Hysteria second is coded 23. 
With 10 scales, and with the possibility that only one is elevated, 
there are 100 profile code groups. With sex and age differences 
likely, it is apparent that a population of several thousand patients 
‘Would be necessary to develop stable data on each profile group. 
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The matter is complicated further by the fact that certain scales 
such as Pa, Sc, and Ma are less frequently elevated than D and Hy. 

The final chapter, an example of profile interpretations by a 
clinician and by a number of computer programs, gives some 
sense of what can be said on the basis of an MMPI. It is by no 
means clear, however, how the statements were derived—certainly 
not from the material as reported in the preceding chapter. Because 
the patient in question terminated his contact before his evaluation 
was complete, we have no way of assessing the accuracy of the 
interpretations. Even if he had stayed and entered therapy it would 
still be difficult to evaluate how much additional and valid infor- 
mation was contained in the MMPI interpretations that was 
not available from his intake interview. A further question, which 
may be dealt with in Volume II, is the matter of how much infor- 
mation is available from the MMPI which makes any difference in 
how the patient is treated. 

It is difficult to review a book about a line of research without 
evaluating the research as well. The book itself is well written, with 
only as much specific vocabulary as is necessary. The most 
serious shortcoming, in the opinion of this reviewer, is the lack 
of evaluative and summarizing statements. One finding after another 
is presented without any attempt to give an overall picture. Nor are 
there evaluations of many of the studies cited. Surely, there were 
weaknesses in design or sampling; surely some findings have been 
confirmed by other investigators. In short, it reads too much like a 
eN arranged annotated bibliography and not enough like a hand- 

ook. 

The sample of interpretations in the last chapter indicates that 
the authors intend that the test be used to develop personality 
evaluations and not just diagnostic statements. Interpretive state- 
ments, however, require some sort of model, explicit or otherwise, 
of personality processes so that one can go beyond merely tabulat- 
ing adjectives which may be used to characterize an individual’s 
behavior. The empiricist approach of MMPI advocates tends to 
minimize theories with the result that interpretations become 
speculative descriptions without implications for treatment. With- 
out some sort of guiding theory one has no basis to choose which 
aspects of behavior to measure. Tradition alone seems to dictate 
the continued measurement of hypochondrical concerns, depres- 
sive trends, ete. Possibly Volume II will indicate whether there is 
any reason to cling to the original nine scales other than the fact 
that users have developed an impressive body of statements as- 
sociated with each scale, statements of unknown accuracy and 
utility. 

GEORGE M. GUTHRIE 
The Pennsylvania State University 
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Robert L. Ebel. Essentials of Educational Measurement. (2d ed.) 
Englewood Cliffs, N. J.: Prentice-Hall, 1972. Pp. xiv + 622. 
$9.95. 


This comprehensive text in educational measurement gives 
“more attention to practical problems and procedures than to the- 
oretical formulations and issues.” After chapters devoted to the 
history of educational measurement and its functions in the process 
of education, Ebel discusses cognitive outcomes as objectives. The 
following items quoted from Ebel’s summary of chapter 3 are in- 
dicative of his educational philosophy: 

8. A major goal of education is to develop in the student a com- 
mand of substantive knowledge. 

9. A person’s knowledge consists of everything that he has ex- 
perienced as a result of his perceptions of external stimuli or 
internal thought processes. 

10. Knowledge is a structure built out of information by processes 
of thought. 

11. Verbal knowledge is a very powerful, uniquely human form 
of knowledge. 

13. Command of knowledge is demonstrated by its use in problem 
solving, decision making, explanation, argumentation, and pre- 
dictions. 

20. The school schould seek to attain affective ends by cognitive 
means. 

In Chapter 4, Ebel explains the important limitations of 
criterion-referenced measurement (much to the delight of this re- 
View). Chapter 5 contains a series of multiple-choice items 
illustrating measurement of various specific objectives-understanding 
Of terminology, knowledge of fact and principle, ability to explain 
and illustrate, ability to calculate, ability to predict, ability to 
-Tecommend appropriate action, ability to make an evaluative judg- 
“Ment. In Chapter 6 there is a comprehensive comparison of ob- 
jective and essay tests and although it is evident that, Ebel prefers 
the former, he does offer useful suggestions for preparing and 
Scoring essay tests. f 

Chapter 7 and 8 deal respectively with true-false and multiple- 
Choice items. Ebel discusses at some length the criticisms of the 
former —triviality, ambiguity, guessing, and harmful effect on 
learning. He then proceeds to their defense and to explanation of 
how to write effective true-false items. He almost convinces that 
this reviewer's own prejudice against such items is not very rational. 
Sut while directing the construction and use of comprehansive 

€minations in the various branches of the Chicago City Junior 

Jollege, we tallied the item-test correlations of hundreds of items 

Classified according to numbers of alternatives. These coefficients 
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averaged much lower for the true-false items than for items with 4 
three, four, or five alternatives. After a time, no more true-false _ 
items were accepted for use in the comprehensive examinations. It 
may be true that such items need not average so low in discriminat- 
ing power, but we can’t expect teachers in general to write effective 
ones. 

Chapter 8 on the writing of multiple-choice items is unequaled 
by other discussions of multiple-choice items. Especially поје- 
worthy are the sample items on pages 193-195 and the later 
numerous contrasts between desirable and undesirable items. From 
each example the reader learns just what characteristic renders 
the items effective or ineffective. 

Ebel occasionally and briefly mentions objective items of types 
other than true-false and multiple-choice. He does not regard 
key-list or matching items with much enthusiasm, nor does he 
advocate the use of items calling for the interpretation of _ 
quoted material. Since 1942, this reviewer has several times ех- 
plained and illustrated such exercises. The number of items rel- 
evant to quoted material should be relatively small because of 
the time required for reading the material. 

Ebel mentions the usefulness of evaluation of items by a col- 
league and explains how item analysis can promote the improve- 
ment of items. This reviewer wishes that the values of cooperative 
efforts in defining instructional objectives in writing test material, 
in scoring tests, and in assigning marks had been given greater 
and more explicit emphasis. 

Chapter 9 deals admirably with such important testing problems 

as test anxiety, cheating, rapid hand or machine scoring, and 
correction for guessing. Chapter 10 is an excellent discussion of 
oral examinations. Such discussions are too seldom found in mea- 
surement texts. 
_ InChapter 11, the treatment of test score statistics is appropriately 
introductory, but deals adequately with percentile ranks, both 
linear and normalized stanines, product moment and rank difference 
correlation coefficients. 

Chapter 12 is a thorough discussion of marks and marking 
systems. Ebel describes the numerous controversies that have ex- 
isted with reference to marks and discusses practical procedures in 
assigning them. It is evident that he favors relative measures of 
status of achievement, letter marks such that the five letters are 
in terms of equal intervals along a scale which reflects the average 
level and variation in ability of a given class. 


1See, for example, “Improving Classroom Testing” No, 31 of the series 
What Research Says to the Teacher. Washington: National Education Asso- 
ciation, 1964. 
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Chapters 13 and 14 explain how to judge the quality of a class- 
room test and how to improve it by means of item analysis. 
Chapter 15 explains how to estimate, interpret, and improve test 
reliability. While the calculation and meaning of reliability co- 
efficients and standard errors of measurement are well explained, 
this reviewer misses the very lucid derivation of the Kuder-Richard- 
son formulas of the first edition of this text. Chapter 16 is an 
equally competent and comprehensive discussion of test validity. 

Chapters 17-21 are the concluding chapters of this book. They 
deal with standardized achievement tests and test batteries; in- 
telligence and aptitude tests; and tests of personality, attitudes, 
and interests. One of these chapters explains standard scores and 
suggests how to determine a passing score. The appendices con- 
tain a useful glossary of measurement terms, projects and prob- 
lems, and a bibliography. 

This reviewer is generally enthusiastic about this book, but a 
number of critical comments can be offered. The chapter summaries 
are very commendable, but there should also be questions for 
class discussion and possibly chapter bibliographies. In the history 
of educational measurement some mention should be made of the 
very influential Progressive Eight Year Study and the Cooperative 
Study of General Education. Something should have been said 
about the contributions to educational measurement made by 
Walter S. Monroe and his initiating the Encyclopedias of Edu- 
cational Research. This reviewer's attitudes toward true-false items 
and neglect of items requiring interpretation of quoted materials 
have already been mentioned. In using this text as the basis for 
4 course in educational measurement, the instructor will need to 
define his objectives, since no definition of instructional objectives 
is given. In spite of these comments, Essentials of Educa 

easurement deserves to be judged one of the two or three best 
books on educational measurement and, in spite of its advocacy of 
true-false items, the best book on classroom testing. 

Max D. ENGELHART 
Duke University 


Herbert Ginsburg. The Myth of the Deprived Child: Poor Child- 
ren’s Intellect and Education. Englewood Cliffs, N.J.: Prentice- 
Hall, 1972. Pp. xvi + 252. $6.95 and $3.95 (paperback). 


The sources of social-class differences in ability and achievement 
lest performance, the disastrous state of urban schools, and the 
Validity of new alternatives in education have received their share 
of attention recently both in the popular press and in scientific 

lerature, While personal opinions abound, empirieal data are 
both sparse and contradictory. Experimental psychology has yet 


856 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


to make truly significant contributions to educational practice. 
One therefore expects very little that is new or valuable from a 
book purporting to analyze “poor children’s intellect and educa- 
tion.” This book defies all such expectations. 

The author’s stated purpose is to examine psychological research 
on the nature and development of poor children’s mental abilities, 
and then to evaluate the psychological assumptions underlying 
recent attempts to improve the education of the poor. 

The introductory chapter is a short and effective reminder of 
the major concern: poor children often fail in school. It is also a con- 
cise statement of the issues involved in attempting to understand 
the reasons for poor children’s academic failure. The critical ques- 
tions are: What is the nature of poor children’s intellect?, What is 
the course of development of their cognitive abilities?, and What 
forms of education are most conducive to adequate cognitive 
development? 

The first issue dealt with in detail is the nature of IQ tests and 
their relevance for understanding poor children’s mental abilities. 
Poorer children consistently score lower on IQ tests than more 
priviledged children, but “what is the ‘intelligence’ that IQ tests 
measure (p. 26)?” Several detailed examples of the administration 
of IQ test items are presented, along with an excellent analysis of 
the numerous cognitive abilities required by each item. It is argued 
that IQ tests (a) measure a number of different mental abilities, 
(b) place strong emphasis on verbal skills while at the same time 
are uncorrelated with standardized measures of creativity (it is 
surprising to find Ginsburg taking "tests" of creativity seriously !); 
and thus cannot be considered to measure fundamental mental 
abilities, (c) do not necessarily measure basie competence due to 
the inappropriateness of standardized testing conditons for eliciting 
optimal performance (the argument would be greatly strengthened 
by reference to Cole, Glick, and Sharp's (1971) cross-cultural 
work). The author further contends that, while the nature-nurture 
question is “non-sensible”, it is clear that environmental events 
can drastically affect an individual’s 10. 

_These considerations lead to the conclusion that social-class 
differences in IQ (10-20 points) are relatively small and fail to 
tell us much about the nature of poor children’s mental abilities. 
It is also argued that the high correlation between IQ and academic 
achievement does not necessarily mean that fundamental intel- 
lectual abilities are required by both, but rather that both “em- 
phasize verbal skills, mental drudgery, and a certain docility of 
character (p. 57)." This is the kind of position for which it 18 
diffieult to muster experimental evidence, but which is even more 
difficult to deny for anyone taking a reasonable look at what 
both IQ tests and school achievement require. 
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» Having discarded IQ tests as uninformative on poor children's 
mental abilities, Ginsburg next examines the “enlightened” view 
_ аб poor children live in a deprived linguistic environment and 
that their deficient speech leads to deficient thought. He discusses 
in some detail Bernstein's well-known distinction between “re- 
stricted code" (predominantly in the lower class) and “elaborated 
code” (predominantly in the middle class). In order to evaluate 
the validity of educational programs based upon this analysis 
(such as Bereiter and Englemann), normative studies of language 
"are critically analyzed. Heavy emphasis is given to the work of 
Labov which demonstrates the complexity of black dialect and the 
sophistication of poor black children's knowledge of it. Ginsburg 
concludes that poor children do not have a language deficit and thus 
that educational programs based on such an assumption are inap- 
propriate. 
_ If poor children are not deficient in language, maybe they lack 
basic conceptual abilities. In order to evaluate educational pro- 
‘grams based on this assumption (such as Klaus and Gray), Gins- 
burg explains the basics of Piaget’s theoretical view of the nature 
Of cognitive development. He also examines cross-cultural studies 
lated to Piaget’s framework and concludes that the results 
onstrate certain universals in the course of cognitive develop- 
t, with only the rate of development differing from one culture 
to another. While this conclusion seems well supported by the data, 
е may still legitimately argue that cultural differences in the age 
аф which Piagetian stages are reached is evidence for the importance 
Of socio-cultural factors in determining mental development. 
Yo convincing response has been offered here to the position that 
Tate of development is both modifiable and significant. 
_ Other evidence concerning the conceptual abilities of poor children 
mes from empirical studies comparing children of different socio- 
Economic classes on a number of different cognitive measures. Some 
the best known work in this area (Palmer, Deutsch, Lesser) is 
Critically analyzed. Ginsburg's analysis is objective and his con- 
Clusions are qualified, but in general he feels that these studies do 
demonstrate any important or meaningful social-class dif- 
ces in mental abilities. This evaluation is supported by re- 
inalyses of the Lesser data and methodological difficulties in studies 
f patterns of abilities (Feldman, in press). Men ae Ц 
"Having concluded that poor children are not deficient in intelli- 
ice, language, or modes of thought, the author next examines 
the cognitive skills of the poor child develop. A common ap- 
ach recently has been to assume that the mother plays a sig- 
cant role in retarding the poor child’s intellectual growth. From a 
tical examination of such well-known studies as Hess and 
ipman’s, Ginsburg concludes that social-class differences in 
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maternal behavior are usually small and nonsignificant and are 
not known to bear any significant relation to cognitive develop- 
ment. In addition, he offers some detailed observations of his own 
children which are interesting examples of the child as the active 
initiator of learning and which indicate that the parent cannot 
succeed in deliberately directing the child’s learning process to any 
significant degree. While the argument is convincing, it cannot be 
interpreted as demonstrating that the parent does not have any 
important effect (albeit unknowingly) on the child’s cognitive 
development. In this regard, it is unfortunate that Ginsburg 
was not able to include recent investigations of the possible im- 
portance of contingency and style of maternal responsiveness 
for early cognitive functioning (e.g., Lewis and Goldberg, 1969; 
К and Wilson, 1972; Watson, 1966; Watson and Ramey, іп 
press). 

Another: well publicized theory of poor children's intellectual 
development is Jensen’s distinction between “associative learning” 
(more typical of poor children) and “conceptual learning” (more 
typical of middle-class children). This is severely criticized on the 
basis of the not very convincing argument that there are few 
significant correlations among different learning tasks for both 
lower- and middle-class children. The argument would be made 
much stronger by the recent work of Rohwer which seriously 
questions Jensen’s formulation (Green and Rohwer, 1971; Rohwer, 
1971). The author also reports interesting results from his own 
study of the development of children’s printing in an open class- 
room. He concludes that “under certain environmental conditions, 
namely the open classroom, poor children as a group display the 
same motivational tendencies and cognitive abilities as do middle- 
class children. This statement must, of course, be qualified by 
reference to the age of the child and the subject matter involved 
(p. 180).” Any special strengths or weaknesses in poor children’s 
E ae are the result of adaptation to a distinctive 
nvironment, and their skills may sometimes be inappropriate 
for what the traditional school Casita of them. ge 

The final chapter reviews the psychological assumptions under- 
lying three different, approaches to educational reform: (a) com- 
pensatory education, (b) improvements in traditional forms of 
education, and (c) open classrooms. Ginsburg’s conclusion is that 
the open school, while still underdeveloped and imperfect, is 
what we need for all children”. At the same time, he recognizes 
the present inadequacies in this approach, including the difficulty 
of finding methods to assess the child's knowledge which are 
suitable to the open classroom. Indeed, this is one of the major chal- 
lenges to educational psychology today. 

Ginsburg has deliberately been very selective in his analysis and 
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the book is definitely better for it. Several instances of reliance on 
irical results which have been severely criticized (Rosenthal, 
iticized by Snow, 1969) or which are very incompletely re- 
ported (Robinson) are unfortunate. But in general the author 
has done a remarkable job of putting well-known empirical studies 
into perspective. His incisive approach is reflected in the statement: 
“statistical significance is not psychological significance (p. 144).” 
He takes a hard, critical look at how studies are carried out and 
interpreted, regardless of what position they support. At the 
same time, he does not pretend to be entirely unbiased: “the idea 

of science as an objective machine leading inexorably to the truth 
“is fallacious (p. 107)." Ginsburg’s own framework for analysis 
and interpretation is clearly and admittedly Piaget’s theory of 
Cognitive devleopment. 

—— The intended audience is “anyone interested in poor children” and 
the author has done a superb job of writing an analysis that is 
both understandable to nonpsychologists and fascinating to the 
‘experienced researcher. The inclusion of case studies and journalistic 
Teports adds much to clarifying and strengthening his arguments, 
does not compensate for the dearth of sound experimental work 
this area. 

While Ginsburg’s main argument is that any social-class dif- 
“ferences (in IQ score, sensorimotor development, etc.) are small 
“and reflect only quantitative and/or content differences in the 
‘Same fundamental cognitive processes, he has not convincingly 
“Countered all protest on this point. Many are quite willing to 
accept that the basic cognitive processes of different social-classes 
not differ, while still maintaining that quantitative differences 
1 IQ represent significant differences with important implications. 
ter all, the very definition of an IQ test score implies that 
dividual differences in rate of growth are important. While the 
"demonstration that different responses to just a few items can 
sult in quite different 1Q scores is certainly pertinent, it is doubtful 
Q supporters will be converted by this argument. Nevertheless, 
cannot go unaffected by Ginsburg’s refreshing emphasis on 
yzing similarities in cognitive development. Present approaches 
the psychology of cognitive growth, and in particular the psy- 
netrics of mental abilities, are fixated upon individual dif- 
fences in performance. But it would seem just as important to 
ine similarities in cognitive development—Piaget has certainly 
oven the value and productivity of such an approach. And we 
6 learn a great deal if we investigated what different experi- 
mtal manipulations can produce similar performance by all in- 
viduals. 
"This book is an outstanding contribution to the application of 
Sychological research to educational practice. One need not agree 
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with all the conclusions to appreciate the insightful analyses and 
syntheses of theory, research, and practice. 
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Homer H. J ohnson and Robert L. Solso. An Introduction to Experi- 
mental Design in Psychology: A Case Approach. New York: 
Harper and Row, 1971. Pp. viii and 216. $2.95 (paperback). 


The authors’ stated purpose for writing this book was the “feeling 
that the basic principles of experimental design can be taught rather 
easily by using fairly simple examples of research.” In essence, they 
use subject matter content as the vehicle for presenting methodol- 
ogy. The alternative approach attempts to present both methods 
and subject matter, and all too frequently methods become obscured 
in the detail and specific procedures of a particular subject matter. 
The student is unable to see the forest for the trees. Johnson and 
Solso have succeeded in presenting a text that is readable, under- 
standable and organized in such a way that the student receives 8 
clear presentation of experimental principles. 


ZZ ee eee В ИИ‏ ن 


j BOOK REVIEWS 861 


| The book itself is divided into a section which presents the prin- 


ciples of design and a second section which consists of a series of 
experiment reprints illustrative of the design and control procedures 
covered in the first section. After presenting a brief introduction to 
the scientific method in Chapter 1, the authors discuss in Chapters 
2 and 3 design strategies and techniques of controlling extraneous 
- variables. Chapter 4 presents a series of experiment summaries 
which the student is to critique and which presumably will allow 
him to check his progress and understanding of the material pre- 
‘sented up to that point. Chapter 5 covers techniques for controlling 
subject variables, and another series of experiment summaries are 
presented for critique in Chapter 6. Chapters 7 through 14 contain 
_ the reprinted experiments each of which is followed by a detailed 
“case analysis” of the experimental report and a series of questions 
| to test the students understanding of the analysis. 
| білсе over half the book is devoted to reprinted experiments, com- 
prehensiveness is consequently reduced. Topies such as hypothesis 
и formulation, the experimenter effect, counterbalancing, and levels of 
- Significance are either not mentioned at all or are mentioned in in- 
Sufficient detail. For instance, level of significance is briefly explained 
"8s a reflection of the probability that the results would be obtained 
‘by chance alone. An experimental text should make clear that 
“chance” is actually variation due to uncontrolled factors (e.g., ex- 
traneous variables, subject variables, experimental error, and etc.) 
rather than some nonidentifiable cause or some intrinsic variability 
_ in behavioral data itself. It was also found that their discussion of 
| "blinds" seems to be at odds with traditional treatments of the 
topic. They use the term single blind to refer to the situation where 
‘the researchers who are judging performance do not know which 
‘treatment the subjects have been assigned to, and double blind 
“Where both judges and subjects are uninformed. Single blind has 
‘traditionally been used to refer to the case where the subject does 
“not know which treatment condition he is being exposed to. В 
— Also, the book could have been improved if answers to ће design 
“Summaries for critiqueing in Chapters 4 and 6 had been presented. 
‘Since the book is designed as a supplementary text, it seems reason- 
able to assume that most students will use it as a source for self 
Чу. Therefore, feedback would facilitate the students assessment 
I his progress and understanding. TNT 
The final point involves the use of the book itself. While it is 
obable that different instructors can profitably use the text, for a 
lety of courses, if used in conjunction with a main text in ex- 
imental psychology there will probably be a great deal of over- 
"BD which may limit its usefulness. It may well find most frequent 
Ше by instructors who are experimentally oriented and want to aug- 
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ment their courses or as a refresher text for upper level undergra E. 
uates. 


В. ALAN Harrop 
East Carolina University 


Pauline S. Sears (Ed.) Intellectual Development. New York: John - 
Wiley and Sons, 1971. Pp. xviii + 579. $12.50. 


Intellectual Development is the first of a projected series of seven 
reviews of educational research to be prepared by the American 
Educational Research Association. The purposes of the series, 40 
cording to the editor are “to promote a systematic development of 
the quickly growing field of educational research” and “to make | 
available to students, teachers, researchers and administrators 8 _ 
comprehensive, useful and organized set of outstanding published 
papers. . . ." Further, the series is to “build understanding across” 
different areas and specialties of educational research and of pre- 
venting insularity among educators and educational researchers.” 

These few matter-of-fact statements serve to obscure the awe- 
some ambitiousness of such a project. Indeed, a little reflection on 
the above goals leads me to question the wisdom of such a project. 
In a time when journals themselves are coming to serve more ana , 
more as archives, not reflections of current status, is such a project | 
as outlined above a viable and useful one? An examination of | 
first volume may assist in reaching an answer to this question. 

Intellectual Development contains 32 articles divided into four 
approximately equal sections entitled The Development of Intell 
ligence, Conceptual Processes, Problem Solving Strategies, and The 4 
Development of Language. Here one must ask, “Сап any such | 
set of articles contribute significantly to the ambitious goals set 
forth for the series?” 4 

If one momentarily holds aside the question of whether the book | 
meets the goals of the entire series and considers the volume alone; 
one may ask how it fares as a book of readings. For this one needs 
at least a short summary of the contents. 

The section on the Development of Intelligence contains a set of 
almost uniformly excellent articles by such authors as Nancy Bay- 
ley, К. Warner Schaie, Herman Witkin, Jerome Kagan and Law- | 
rence Kohlberg although not all of them deal directly with develop- 
mental processes and one, Guilford’s “Intellect Has Three Faces” 
does not touch upon developmental issues at all. Several articles 
contain explicit methodological pitfalls in an attempt to make the 
reader aware that the study of intelligence itself has many faces. 

The section on Conceptual Processes begins with several articles 
written from a “behavioristic” point of view continues with а “trant 


sitional” article by Huttenlocher from a "cognitive" point of view Г 


M. 
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and then devotes the remaining articles to studies testing Piaget’s 
notions concerning the development of concrete and formal opera- 
tions. The section also contains Bruner’s general formulation of the 
course of cognitive growth. The words in quotations in the above 
paragraph reflect my confusion over the use of those words in the 
context of the book. 

Problem Solving Strategies, to me differs from the preceding sec- 
tion on conceptual processes largely on the basis that the words 


_ “problem solving” appear in most of the titles. Articles deal with 


1 


probability learning, learning sets and Simon, Newell and Shaw's 
classic on the human being as an information processing system. 

The concluding section, The Development of Language, is rather 
mislabeled since it deals not so much with the development of lan- 
guage as with the effects of language on thinking, learning and be- 
havior in general. 

It is difficult, at one level, to be critical of the inclusion of any 
particular article in the volume. This is so because of the criteria 
which the editors used to select the papers. It is important to know 
these criteria and how they were used since such an understanding 
provides much insight both into the difficulty in constructing the 
Volume and into the volume itself. In abbreviated form the criteria 
are 

1. The reading is clearly outstanding and likely to be influential 

for future work. 

2. The reading can stand alone. 

3. Sound methodology and analyses were performed. A 

4. The reading is interesting, clear, and comprehensible to first 

year graduate students. 

5. The reading suggests educational implications and sheds light 

on process. у 

6. The readings as а whole provide some diversity in content and 

theoretical approach. 

7. The reading has not been widely reproduced in other books. 

8. Each paper presents evidence on developmental phenomena. à 
Any particular article however need not meet all criteria, only if it 
fails on some it is considered very important on others.” With the 
exception of criterion seven, which seems justifiable only in terms 
of economic expediency and not the goals of the volume, one can- 
not reasonably quibble with any criterion. But given that the ar- 
ticles must only meet some of the criteria, it should be immediately 
obvious to the reader why it is impossible to eriticize any selection, 
although one can question the use of such an airtight system of 
Selection. ү 
‚ Omissions, however, are another matter. Any student in educa- 
tional research is going to sooner or later, probably sooner, encoun- 
ter the work of Arthur Jensen and the controversy surrounding that 
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work. Intellectual Development makes a passing reference to this 
research in a rather odd context and does not include any of the 
research itself, This hardly seems justifiable. 

Many of the articles test or discuss Piaget’s conceptions of cogni- 
tive growth. Yet there is no word from Piaget himself. This ab- 
sence is all the more peculiar when it is noted that one selection is 
Bruner’s reinterpretation of Piaget of which Piaget is highly critical, 
and two selections are from Bruner, Oliver and Greenfield (1966). 
The Course of Cognitive Growth, a book specifically repudiated by 
Piaget, is not representative of his thought. Given these facts, the 
inclusion of Piaget’s review of the Course of Cognitive Growth 
(Piaget, 1967) seems more than warranted. 

The biases in the selection on language development are more 
pronounced. Of the six articles published since the appearance in 
1960 of Bernstein’s “Language and Social Class,” one is by Bern- 
stein and two others rely on this article for their hypotheses con- 
cerning learning in Black children. It seems wholly inappropriate 
that the reader should not be advised that these studies have been 
labeled “racist” in some quarters, and, more to the point, erroneous 
in many others, As a matter of balance an article such as that of 
Тађоу (1970) should have been included. 

At a less topical level there is the matter of theoretical bias. It 
is clear in the selections that a bias exists toward making language 
a determinant of cognitive development. Again, in a volume which 
deals heavily in Piagetian theory, it is surprising that the reader is 
not advised that Piaget explicitly refutes this viewpoint. One must 
dig deeply into Kohlberg’s article on early education to find a hint 
of Piaget launching his counter thrusts. And what more challenging 
article has appeared lately (and been partially replicated) concern- 
ing the relation of language and cognitive growth than Piaget’s 
study on the development of memory (Piaget, 1968) ? 

_ There are, in addition, omissions of certain areas. One can, I be- 
lieve, make a forceful argument that the development of reading 
skills is as much a part of intellectual development as the develop- 
ment of oral language. One can make an even more forceful argu- 
ment concerning creativity. By omitting any concern with the area 
of creativity, the volume is telling teachers and administrators that 
creativity is outside of the realm of the “intellect.” (Whatever that 
is.) I do not think such a message is a healthy one. (‘The objection 
that inclusion of such areas would make the volume unmanageably 
large says more about the practicality of the undertaking than the 
validity of the criticism.) 

І am not particularly worried about the “message” delivered by 
the volume, however, it is my belief that the series cannot achieve 
its goals. For this volume in particular, the goals of reaching teach- 
ers and administrators will not be met. Criterion number four states 
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that the selections must be comprehensible to first year graduate 
students. If one asks, “First year graduate students in what?”, the 
answer is immediately clear: psychology or educational psychology. 
For this reason it is more likely to perpetuate than eliminate in- 
sularity. 

There is yet another level at which the volume fails to meet the 
goals set for itself. One of its goals is to be forward looking and 
teference is made to a statement by Lee Cronbach that we are train- 
ing students for a status quo which no longer exists. But the volume 
does not truly take into account what is happening to that status 
quo. For most children in the United States, development takes 
place in a social context which includes a nuclear family up to age 
5 or 6 at which time school is superimposed for the next 12 years. 
The volume takes no real account of this context, thereby perpet- 
uating the myth that there is a “thing” called intellect, nor the fact 
that this context is under attack as never before. Critiques of the 
Schools are legion. Yet the volume assumes the continuance of a 
school system with its “hidden currieulum"—in the separate senses 
that both Jackson (1968) and Illich (1971) have used that phrase. 
- It also assumes the preservation of the nuclear family despite grow- 
ing attack on this institution (Laing and Esterson, 1964; Cooper, 
1970; Slater, 1970; Firestone, 1970; Greer, 1971). ў 

Unless the series, in some way takes into account this changing 
| context, which does not seem evident by this first volume, the series 
will stand largely as a monument to the insularity of the American 
Educational Research Association from the rest of the world. 
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| CONCERNING THE MEAN OF THE CENTRAL | 
Е DISTRIBUTION 


JULIAN C. STANLEY: 
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his note, Ramseyer (1972) proved that the mean of the cen- 
| distribution (ie., when the null hypothesis is true) is greater 
n 1. He did not indicate what the mean actually is. It сап be 
(e.g., see Hald, 1952, pp. 374-375) that 


ER. ecg, (ES 


e fı and fz are the degrees of freedom for the chi-square vari- 
in the numerator and the denominator, respectively. This is 
т than 1 except when fz is infinite; then it becomes 1. 

he modal F is fo(f: — 2 ЛА + 2)1, for f, > 2. This can take | 
ues 0 (when f, = 2) through 1 (when both f; and fz are infinite). 
Hald (1952, p. 375) for distribution curves of F for the follow- 
df: 10, со; 10, 50; 10, 10; and 10, 4. Also see Glass and Stanley 
70, pp. 233-234) for distributions when df are 4, 4 and 4, 25. 

п а rather different, applied context Stanley (1957) showed that 


£5 - 5$) 


хуу = 0—ie., if the denominator variable covaries zero with 
tio. For example, because of the way that the Stanford-Binet 
gence Scale was constructed this tends to occur in an unselected 
lation when X is mental age, Y is chronological age, and 100 
) is IQ. (See MeNemar, 1942.) Then 100[E (MA)/E(CA)] = 
ersons for helpful comments concerning 


V Glass, Leon J. Gleser, Bert F. Green, 
and Gary C. Ramseyer. 


+ am indebted to the following P 
‘topic: Acheson J. Duncan, Gene 
muel A. Livingston, Andrew C. Porter, 
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E (IQ), because there Е (МА) = E (CA), and Е (IQ) = 100. In prac- 
tice, however, сол, го Within school classes is usually negative, so the 
ratio of means tends to underestimate the mean IQ. 


REFERENCES 


Glass, G. V and Stanley, J. C. Statistical methods in 95. 
and psychology. Englewood Cliffs, N. J.: Prentice-Hall, 1970. 
Hald, A. Statistical theory with engineering applications. New 

York: Wiley, 1952. 
McNemar, Q. The revision of the Stanford-Binet scale. Boston: 
Houghton Mifflin, 1942. 

Ramseyer, G. С. A note on the expectation of the F-ratio. EDU- 
CATIONAL AND PSYCHOLOGICAL MEASUREMENT, 1972, 32, 111-115. 
Stanley, J. C. Index of means vs. mean of indices. American 

Journal of Psychology, 1957, 70, 467-408. 


ve 


THE LACK OF RETEST RELIABILITY FOR 
INDIVIDUAL DIFFERENCES IN THE Ž 
KINESTHETIC AFTEREFFECT 


ARLENE Н. MORGAN ax» ERNEST В. HILGARD: 
Stanford University 


Lj 


ENT interest in the kinesthetic aftereffect (КАЕ) has taken 
Ў directions. First is а more precise quantification of ће КАЕ 
lowing the procedures of Kéhler and Dinnerstein (1947), who 
strated the KAE by measuring changes in judgments of 
h (Bakan and Thompson, 1962, 1967; Hilgard, Morgan and 
ak, 1968). Second is a classification of individuals on a per- 
tual style dimension based on individual differences in the KAE 
ponse. Petrie (1967) proposed a theory of augmentation- 
ction, which states that some people “augment” incoming 
li and others “reduce” them. These characteristic styles can 
lentified, according to Petrie, by the КАЕ response. Augmen- 
exaggerate the KAE when the standard block feels wider 
er rubbing a smaller block); reducers exaggerate the КАЕ 
п the standard block feels narrower (after rubbing а larger 
К). Petrie relates this tendency to а number of personality 
ables, the most important of which (in her theory) is pain tol- 
ce. She proposed that people who tolerate pain well are those 
reduce incoming stimuli, and that they can be selected by 
extreme response in the КАЕ when the stimulus block is 
than the test block. Poser (1960) and Ryan and Foster 
7), supporting her theory, reported that reducers, as selected 
ir KAE response to the “reduction” contrast, tolerated pain 


| ‘This in the Laboratory of Hypnosis Research un- 
mm Mpeg from the National Institute of Mental 
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better than nonreducers. Other studies, while not strictly compar- | 
able in method, also report a relationship between the КАЕ and | 
pain tolerance (e.g, Sweeney, 1966; Dinnerstein, Lowenthal, - 
Marion, and Olivo, 1962). 

A recent attempt to replicate Petrie’s findings was unsuccessful 
(Morgan, Lezard, Prytulak, and Hilgard, 1970). Not only did that 
study fail to show a relationship of the KAE with pain tolerance 
and other personality measures, but the classification of “augmen- 
ters” and “reducers” from the KAE responses was not clear. The 
correlation between the amount of change under “augmentation” 
conditions and under “reduction” conditions on the appropriate 
days was —.34, indicating that those who augmented the most 
(positive score) tended also to reduce more (negative score) under 
that contrast. While such a tendency is contrary to Petrie’s formu- 
lation (the exaggeration of both augmentation and reduction, 
called “stimulus-governed” by her, being supposedly rather infre- 
quent), it is intuitively understandable. That is, individuals who 
are “contrast sensitive” will respond to the contrast in either di- 
rection. The reply to the question of what is so is more an empirical 
than a theoretical one, and, in any case, depends on the reliability 
of the KAE measure. 

Petrie (1967) reported a split-half reliability coefficient of .97 
for the КАЕ measure. However, this correlation is so high because 
it was based on scores derived from two highly correlated measures, 
each subtracted from a common baseline. The KAE was measured 
in three measurement periods, each of which consists of four trials 
made within a few seconds’ time; Petrie’s reliability estimate is 
based on the sum of all Trials 1 and 4 versus the sum of Trials 2 
and 3, all subtracted from the common baseline. It is to be expected 
that width estimates of a standard block, taken at the same time 
under the same conditions, will be highly correlated. Thus this kind 
of split-half reliability tells us nothing about the reliability of aD 
individual’s tendency to “reduce” or “augment” from one testing to 
another, as a perceptual style. Petrie reports the correlation be- 
tween mean reduction and mean augmentation (with a minimum 
interval of 48 hours) to be .77 for a sample of 28 adults. The differ- 
ence between this significant positive correlation, and the nonsig- 
nificant one of —.34 reported by Morgan and others (1970) has 
not been explained. 
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er reliabilities have been reported by Eysenck (1955) to 
from .78 to .94, although these correlations are high because 
were based on two reduction measures taken within the same 
session, and computed from the same baseline. Spitz and 
ап (1960) report retest reliability of .74 for the reduction mea- 
though the test-retest interval was no longer than 20 minutes. 
present study was designed to measure the retest reliability 
KAE on different days, comparing the amount of reduction 
h day from that day’s own initial baseline. 


Method 


ubjects were 40 male university students who were invited to 
ticipate in three one-hour testing sessions at $2.00 per hour. 
he procedure, explained in detail in Petrie (1967), was as fol- 


he subject was blindfolded and his right hand placed on a rec- 

gular ("standard") block two inches (50.8 mm.) wide. His left 
land was placed on a tapered block 30 inches long, varying in 
from 14 inch at the lower end to 4 inches at the upper end. 
tapered block is equipped with a finger guide which moves 

а ruler from which the width estimates are recorded. The sub- 
was asked to “find a spot on the tapered block that feels just as 
e” аз the block in his other hand. When he reported having 
it, the scale reading was recorded, the subject’s hand was re- 
led to the end of the scale, and he was again asked to “find а 
that feels just аз wide. ...” 


Vhen the baseline estimate had been obtained, the gubject's 
were removed from the blocks; the right hand was placed on 
-inch (63.5 mm.) stimulating block, while the left hand 
. The subject was then asked to rub the stimulating block, 
T which his hands were returned to the tapered and standard 
, and the Measurement I estimates were recorded. Two more 
ement periods were recorded the same way. | 

the present study, a baseline estimate only was obtained оп 
bject’s hand to the lower end of the scale, 

e Lowenthal (1966) reported that estimates 


from the 1 d were consistently lower than estimates made from 
Be ее pps study recorded the estimates alternately 


lower and upper ends of the tapered block. 


874 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Day 1, and no contrast stimulation was given. One week later, a 
baseline was again established and reduction stimulation given. 
On the third day, the Day 2 procedure was repeated. Thus Days 1 
and 2 provide a retest of the baseline estimate, which is critical in 
the measurement of reduction. Days 2 and 3 provide a retest of the 
reduction measure itself. 

Each experimental hour was filled by administering tests de- 
signed to measure other cognitive styles (leveling-sharpening, re- 
pression-sensitization, field dependence-independence) for com- 
parison with the augmentation-reduction dimension. These results 
were negative, and will be reported elsewhere (Morgan, in press). 


Results and Discussion 


The mean absolute change in width estimate on Day 2 was 4 mm. 
(s.d. = 3.4); on Day 3, it was approximately 3 mm. (s.d. = 3.0). 
Table 1 gives correlations between width estimates in the various 
measurement periods on three days. 

These correlations reflect a tendency for subjects to make quite 
consistent width judgments from one day to the next. Within days 
(boxed areas), the correlations are extremely high. The high corre- 
lations between Baseline and Measurement periods further reflect 
the tendency of all subjects to reduce their width estimates following 
contrast stimulation; if, in fact, some individuals “reduced” from 
their baseline significantly more than others, the reduction stimula- 
tion should produce a shifting of individuals in the distribution of 
post-stimulation scores, and the Baseline-Measurement correlations 
would be low. Thus, while these high correlations confirm the lawful- 
ness of the kinesthetic aftereffect as a psychophysical phenomenon, 
they do not validate individual differences in the reduction tendency. 

We can predict a low reliability for the КАЕ (change) score оп 
the basis of these high correlations alone, using MeNemar's (1969) 
formula for predicting the reliability of difference scores. 

Ta = (їхх + Туу — 2rxy)/2 — 2rxy), where т is the reliability 
of the differences scores, rxx the two-day reliability of the baseline 
scores, and ryy the two-day reliability of the post-stimulation score 
(Measurement III). 

Then: 

Trx = ЛО (the correlation between the Day 2 and Day 3 
baselines); 
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fry = .79 (the correlation between the Day 2 and Day 3 
Measurement III periods); 

and 7хт = .80 or .87 (the correlation between the baseline and 

Measurement III on Day 2 and Day 3, respectively). 

Using the Day 2 rrr of .80, the predicted reliability of the dif- 

ferences is as follows: 


_ 70+ .79 — 2.80) 
77772 — 8680) 


1.49 — 1.60 —.11 
2-10 ~ 40 
= —.28 
Because of the high correlation between baseline and post- 
stimulation measurement (гху), McNemar’s formula predicts a 
slight negative correlation between difference scores (KAE) on the 
two days, hence no reliability at all. Our results are consonant 
with this prediction, as seen in Table 2. 
To state the argument another way, post-stimulation responses 
ought not be highly correlated with their baselines because the ex- 
perimental treatment is supposed to cause a differential reduction 
from baseline according to individual differences. The intercorre- 


lations lead inexorably to a low retest reliability for reducing 
scores, 


+ 


TABLE 2 
Correlations of the KAE Reduction Measure on Two Days 
(М = 40) 
r 
НО А AM ES 0. 
КАЕ score, 
Day 2 vs. Day 3 
Meas, I .09 
Meas. II .15 
Meas. III .10 
Mean, all Меаз, 18 


| 
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‘contradiction between this finding and the high reliabilities 
by others, however, remains unexplained. In an attempt 
plain it, the present data were examined for residual after- 
t. Bakan and Thompson (1962) define residual КАЕ as re- 
on which persists from one testing to another, lowering the 
ine on the retest day. They found a significant correlation be- 
the amount of reduction on the first testing and the decrease 
eline estimate on the retest after one week and again after 
month. That correlation in the present study turned out to be 
48, indicating that, in fact, the greater the reduction on Day 
greater the residual carried over to influence the baseline on 
3. If the amount of reduction on each of the two days is com- 
from the initial uncontaminated baseline, the two-day re- 
п reliability becomes .59. This approaches the range of the 
tions reported in other studies in which the two reduction 
sures are computed from a common baseline.* 

might well conclude that the KAE phenomenon is too com- 
to be very useful in the study of individual differences. It has 

alidity, as evidenced by the high correlations between pre- 
post-contrast stimulation. As a mean effect, it is reliable, in 
it consistently occurs, and does so with sufficient reality to. 
ап effect, from one testing to another. However, the use of the 
ав а descriptive personality variable is strongly discouraged 
results of the present study. 
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THE RELIABILITY OF DIFFERENCES BETWEEN 
LINEAR REGRESSION WEIGHTS IN APPLIED 
DIFFERENTIAL PSYCHOLOGY 


FRANK L. SCHMIDT 
Michigan State University 


In many areas of psychology, education, sociology, economics and 
other behavioral and social sciences, a relatively common research 
design is one that calls for the prediction of the standing of a person 
or thing on one variable, often designated the criterion, from his or 
its standing on a number of other variables, often called the predictors. 
When the relationships in question are linear, least squared error 
multiple regression weights are most commonly used in weighting 
the predictors into a composite. These weights minimize the sum of 
the squared deviations of the observed from the predicted criterion 
scores (Anderson 1958). In practice, the sample regression weights 
(8) are often computed on relatively small samples, and as a result, 
are only rough approximations to the population regression weights 
(8), which are, by definition, the most effective set of predictor 
weights possible. If applied to the entire population, B would produce 
some correlation, p(é), and В would produce р(8), the maximum 
correlation. Although р(8) will vary depending on the chance dif- 
ferences between different B, p(8) is a parameter and thus has only 
one value. A previous study (Schmidt, 1971) showed that, for certain 
combinations of N (sample size) and p (number of predictors) in 
applied differential psychology, simple unit weighting of predictors 
(summing of z scores of predictors) produces, on the average, a 
larger correlation in the long run, i.e, in the population, than В. 
With small N and large p, these differences in predictive efficiency 
favoring simple unit predictor weights (1) over В were large enough, 
in some cases, to be of practical significance in applied situations 
(e.g., .12-.13 correlation units). 
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Obviously, these results obtained because Ê, as a function of 
random sampling fluctuation, contained much error.’ That is, 
differences between elements in the В; were a reflection of sampling 
error as well as of actual differences between elements in the corre- 
sponding В, Error variance associated with a given regression weight 
is usually conceptualized as the variance of the distribution that 
would result if, holding N and p constant, an infinite number of 
estimates of a given regression weight were computed, each froma | 
new sample. This is the conceptual model underlying the formula 
for the standard error of the regression weight. However, another 
error model may be of more explanatory value here—a simple 
variance components model inspired by classical test theory. 

If we assume that sampling error is independent of such parameters 
as В and 2, (the population validity vector), then the total variance 
within estimates of these vector parameters can be viewed as the 
sum of the population variance of the parameters and error variance 
due to sampling. Suppose, for example, that the variance across a set 
of population regression weights, Вь, is .0520 and the variance of a 
sample estimate of these weights, Ês, is 0640. Then, using the present 
model, variance due to sampling error is .0640 minus .0520, or .0120. 
Total obtained variance (.0640 here) equals true variance (.0520 here) 
plus error variance (.0120 here). 

In equation form: 


3 b 2 2 
Cw = со to, 


where (1) 


ay) 2 

Trot = OA) 
2 

Gan = op. 


c^ due to sampling error | 


s 
1 


Using this model, опе can compute the reliability of the difference 
between elements in an estimate of a population parameter vector’ 
For example the reliability of differences between elements within 

| 


1 Sampling fluctuations were the only source of error in the В. The data used 
fit the linear multivariate normal model and there was no post hoc selection of 
p’ predictors (where р’ < р). 

2 This model is, of course, applicable only in instances in which parameters 
are vectors rather than scalar values. Scalar value parameters obviously have 
no population variances, but when the parameter of interest is a vector, We 
may speak of the variance in the population across the values in the vector. | 


| 
| 
| 
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a given set of sample regression weights, Вх, would be [0°(8)/°(8.)]. 
This ratio corresponds to the basic definition of reliability as the 
ratio of true to total variance or the proportion of total variance that 
is true variance. Of course, о*(В) is usually not known in practice, 
and so for most purposes of practical reliability estimation, this 
model is not of much value. However, a Monte Carlo approach to 
this problem developed Ђу the writer calls for specification of the 
“population” correlation matrix (2,,) by the investigator. When 
J,, is known, 8, and thus c*(8), can be computed directly. B and 
0*(8) are then computed from sample correlation matrices (В) 
generated from Z,,. For any М and Z,,, generation of a number of R 
allows the single estimate of o°(8) to be replaced by the average of 
a number of such estimates, designated «o7(@). The present study 
had two purposes. The first was to examine systematically, via the 
above ratio, the average reliabilities of differences between regression 
weights in the data domain of applied differential psychology, 
Multiple regression techniques are used almost routinely to weight 
predictors differentially; often obtained weights are interpreted 
directly, with relative size being used as an index of the theoretical 
or practical significance of variables. The question of the reliability 
of the differences between these weights is therefore an important 
one. The second purpose of this study was to ascertain magnitudes 
of these reliabilities necessary under various combinations of N and p 
for regression weights to equal and to exceed simple unit predictor 
weights in predictive efficiency. To the present author’s knowledge, 
the literature to date contains no studies addressed to these two 
problems. 


Method 


Sampling Plan for È., Matrices 

In order to insure that the Z,, used were similar to those actually 
existing in the data domain of interest, it seemed appropriate to 
employ R matrices from the data domain as estimates of Bay For 
all practical purposes, the Ё are unbiased estimates of their respec- 
tive Z,,. Four journals—Educational and Psychological Measurement 
(EPM), Journal of Applied Psychology (JAP), Journal of Educational 
Psychology (JEP), and Personnel Psychology (PP)—were selected as 
Tepresenting the data domain of applied differential psychology, and 
the years 1959-1969 were selected for examination. For two of the 
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journals—EPM and JEP—the odd years were sampled and for the 
other two, the even years. In both cases, correlation matrices of odd 
numbered dimensions from 3 X 3 to 11 X И were recorded. In 
some cases, only parts of larger matrices were used. Correlation 
vectors containing negative or zero values were not used as validity 
vectors, and it was sometimes necessary to rearrange rows and 
columns in order to meet this condition. An attempt was made to 
keep all validities above .20—a value chosen as approximately the 
minimum that would ordinarily be used in practice. For each matrix 
size, a random sample of 10 matrices was drawn from the pool of 
recorded matrices of that size and used as estimates of the 2,,. 
For certain of the larger matrices, a sample of 10 could not be obtained 
using only the journal volumes designated in the sampling plan, and 
additional samples had to be taken from the previously unused 
volumes. Even so, only eight 11 X 11 matrices could be found and 
two had to be taken from another source (Wechsler, 1949, р. 10). 
Obviously, the sampling fraction was much larger for the large 
than for the small matrices, 


The Program 


Sample correlation matrices (R) computed from randomly drawn 
samples from a multivariate normal distribution are distributed as 
W(N, p, 2.,), the Wishart distribution. For each given N and За 
combination, 100 R matrices were generated from this distribution 
and 100 В; were computed, which, in turn, yielded 100 07(8) values.” 
The average of these 100 values of c^(8) was taken as the estimate 
of єс*(Д) for a specific E., at a given N level and c"(8)/ec^ (8) was 
taken as the estimate* of the average reliability of differences between 
regression weights for that Z.,, given N. These reliability estimates 
were then averaged across the Z., to provide an estimate of mean 


8 This program was written by Dr. Vernon Urry, now at the University of 
Washington. The process of R-generation works by means of the Bartlett de- 
composition of the Wishart distribution (Bartlett, 1933; Kshirsogar, 1959; 
Wijsman, 1957). This approach to sampling from N(u, >) has been discussed 
and employed by Browne (1968) and Herzberg (1969), both of whom have 
carried out tests of the simulated data showing that the required assumptions 
are met. In addition, Herzberg (1969) showed that the results from simulated 
data were almost identical with the results from a large sample of empirical 
data. 

*It is well known that, strictly speaking, [k/e(X)] 52 e(k/X) where Е = 
а constant and X is a random variable), but, as Browne (1969) has shown, the 
discrepancy is so small as to be negligible for psychometric purposes. 


us 
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diability in the data domain as a whole at given N and p values. 
\ 'alues of N and p lying within the ranges most frequently encountered 
| in practice were chosen. Ns investigated were 25, 50, 75, 100, 150, 
- 200, 500, and 1000. Values of p used were 2, 4, 6, 8, and 10. 


Results and Discussion 


- "Table 1 presents the average reliability of differences between 
“regression weights for all N and p values for all matrices. It can be 
seen that differences between regression weights do not, in general, 
` attain levels of reliability generally considered satisfactory for most 
_ psychometric instruments until № of 500 or more are reached. With 
е exception, all reliabilities for sample sizes of 75 or under are 
or less, even when р is small. Reliabilities in the .20’s and .30’s 
ате much in evidence, and some are even lower than .20. The last 
column in Table 1 shows the average reliabilities at each N level across 
p values. In the last row of Table 1, where N = œ, сео“ (В) becomes 
| ‘€0°(8), the reliability ratio becomes 1.000, indicating that differences 
| between regression weights are perfectly reliable when the "sample" 
_ used is the entire population. 
ы Аъ examination of the В for each of the 50 3,, revealed that 31 
had one or more suppressor variables. Since suppressor variables 
are rarely used in applied differential psychology (Adkins, 
_ 1947), it is probably hazardous to generalize to this data domain 
from the present matrix sample. Therefore, Table 2, showing 
е average reliabilities of differences between regression weights 
Ts those matrices without suppressors, Was computed. In the bot- 


TABLE 1 


‘Average Reliabilities of Differences between Regression Weights across AU Matrices 
of Sample Regression Weights for All Combinations of N and p 


| р Меапз 
Ј across 
M 2 4 6 8 10 all p 
К 
25 Е .289 .2513 .1931 .1131 .2305 
50 p p .3946 :3609 .2632 .3912 
m .5262 15074 ‘4701 14443 :3604 14616 
_ 100 16183 15558 15376 15294 4338 15350 
150 .6110 16398 16196 16300 5355 16071 
_ 200 16771 16919 16549 16756 .6045 :6608 
2500 - 7724 .8453 . 7998 .8374 ‚7184 .8066 
000 -8055 18977 18624 19329 8812 18759 
d 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 
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tom row of Table 2 may be seen the number of matrices remaining 
at each p value after those containing suppressors were removed. 
Again, the mean reliabilities across all p values are given in the 
last column of the table. 

Comparing Tables 1 and 2, one can easily see that the average 
reliabilities are consistently higher when matrices containing sup- 
pressors are included. This results from the fact that the existence 
of suppressors leads to larger differences between elements in 6, 
which, in turn, leads to larger values of c*(8). Since error variance is 
independent of о*(В) and depends only on N and р, the effect is to 
increase the average reliability ratio [07(8)/ees°(6)] at all N and р 
values. Thus differences between computed regression weights will 
tend to be more reliable when one is dealing with a population which 
contains suppressor variables. Reliabilities that would generally be 
considered adequate for most psychometric instruments for most 
uses are, in general, achieved for differences between the regression 
weights for the matrices without suppressors at N levels somewhere 
between 500 and 1000." For all sample sizes below 75, irrespective 
of p, these reliabilities are below .50. At smaller sample sizes, reli- 


TABLE 2 
Average Reliabilities of Differences between Regression Weights of Sample Regression 
Weights for All Combinations of М and р for Those Matrices 


without Suppressors 

p Means 

across 

О SELLE STEN, 10) Alle 

25 12489 .0904 .19037 .09575 .0755 .1402 

50 -4351 .1629 .32835 .23200 1877 .2692 

75 .4580 .2436 .44017 .30745 .2987 .3495 

100 .5597 .2761 .52550 :35375 :3576 .4145 

150 .5489 .3805 .60657 .46830 .4537 .4916 

200 .6303 .4654 .67015 .55700 .5505 „5746 

500 .7089 .7079 .84157 .72820 «777 -7508 

1000 ‚7663 .7837 .89750 .88595 .8596 .8385 

© 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 
No. of 

Matrices 8 4 4 2 1 19 


5 With a large sample of За; at each p value, it would be expected that 
mean reliability would consistently decrease as p increased within each sam- 
ple size, since the addition of each predictor leads to a loss of one degree © 
freedom. In fact, this pattern does obtain in Table 1 up to and includ- 
ing a sample size of 100. But аз N increases from 100 to co, error variance 
due to sampling (о?) becomes less important relative to eo?(8) in determin- 
ing mean reliability at each p-value. Due to sampling fluctuations in selecting 
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i ies below .20, and even below .10, are much in evidence. Since 
ple sizes used in research in most areas of applied psychology 
d to be relatively small (Lawshe and Schucker, 1959), it is probably 
sonable to conclude from these data that the reliabilities of 
ferences between most regression weights reported in the literature 
between .10 and .60. Implications for the practice of directly 
nterpreting the relative magnitudes of regression weights computed 
| on small samples are obvious. In addition, it should be pointed out 
t the reliability estimates in both Table 1 and Table 2 are probably 
‘overestimates of the actual reliabilities in their respective data 
lomains. The use of sample matrices from the literature as estimates 
the Z,,, the population matrices, tends to inflate values of o7(8), 
thus inflating the reliability ratio [o*(8)/ec" (8)]. 
The explanation for this inflation of a? (8) values is relatively 
aightforward. In X,, matrices, the greater the variance of pre- 
‘dictor intercorrelations and validities, other things equal, the more 
the individual regression weights in 8 will differ from each other. 
Because of the addition of variance due to sampling error, the 
Variance of predictor intercorrelations and validities can be ex- 
pected to be higher in sample matrices taken from the literature 
than in their corresponding population matrices. The result® is 
ger values of о? (В). 
For correlation matrices in applied differential psychology with- 
“out suppressors, regression weights begin to be superior to unit 
eights, on the average (across p values), when the sample size is 
out 100 (Schmidt, 1971). The average reliability of differences 
between regression weights at this sample size can be seen in Table 
) to be .4145. When suppressors are allowed, regression. weights 
begin, on the average (across p values), to be superior to unit 
ights at a sample size of about 50 (Schmidt, 1971). At this № 
level, the average reliability of differences between regression 
Weights is only .3912, as can be seen in the last column of Table 1. 
"For both matrix samples, then, more than half the variance of the 


the tendency for & larger number of sup- 


г variables to appear than probably exist in the parent population 


latrices. ‘ 
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regression weights is error variance at that N level where regres- 
sion weights overtake unit weights in predictive power. 

If we arbitrarily assume that .0150 correlation units is the min- 
imum increase in predictive power, for most practical purposes, 
that will render the computation of regression weights worthwhile, 
then for the entire X,, sample, averaging across levels of р, the 
minimum sample size needed is 60 (Schmidt, 1971). At this sam- 
ple size, the average reliability of differences between regression 
weights is approximately .420. For matrices without suppressors, 
the corresponding sample size is 184, and the average reliability at 
this N level is approximately .550. Both of these reliability figures 
are low relative to commonly accepted standards for psychome- 
tric devices. Even when regression weights do show more predictive 
efficiency than simple unit weights, differences in magnitude be- 
tween the individual weights are still very unreliable. Once again, 
the implications for the practice of directly interpreting relative 
magnitudes of computed regression weights are obvious. 
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ROBUST TESTS FOR HOMOGENEITY OF VARIANCE 
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Е sometimes has an a priori interest in testing the homogeneity 

1 of variance of k independent groups. For example, B. F. Skinner 

_ (1958) predicted that achievement scores of students finishing pro- 
grammed lessons would have a smaller variance than students 
taught by other methods. If Gagne's (1965) hierarchially ar- 
ranged behaviors are involved, small variance in low level skills 
would simplify teaching for higher skills. 

А second reason for interest in Ho: 012 = o = *** ск? is as an 
Assumption needed to guarantee the accuracy of various tests on 
шеапз. Although the analysis of variance test of equal means is 

“robust to violations of this assumption when equal n's > 10 are 
"used, if the n’s are unequal, this assumption may be critical (Box, 
1954). 

Even with equal n’s, the Tukey Wholly Significant Difference 
_ (Tukey, 1953; Miller, 1966), the Newman-Keuls (Keuls, 1952; 
Newman, 1939), the Duncan Multiple Range (Duncan, 1955), the 
"Least Significant Difference (Fisher, 1949), and the Вођене (1953) 
| multiple comparison tests all require homogeneous variances, An 
à priori specified test on a contrast 
р (Zc = 0.0) via? = Zo X,/ УМ By (ej /;) 

_ also requires this assumption (Kirk, 1968). The robustness of these 
tests to violations of this assumption is unknown. 
2 Now at Miami University, Oxford, Ohio. 

tributed by the Ohio Uni- 


№. 2 Computer time for the studies described was соп! 
_ Versity computing center. 
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Most texts have failed to emphasize the differences between the 
normality assumption in tests of means and variances. The great 
difference in the normality assumption in inferences based on these 
two statistics is shown by the two standard errors. от = охуп 
regardless of population form, but the standard error of в? = са = 
а ој — 1) + ya/n where ysis the kurtosis index of the population 
(Johnson and Jackson, 1959). y, is defined as (2) — 3 = 
E[(X. — ux)']/o — 3 and may vary from —2 to +œ. Normal 
distributions have a ya value of 0. The normality assumption is 
needed to fix the magnitude of the sampling fluctuation of 8. If 
the population is leptokurtie (0.0 < ya < +=), the true value 
of a, will be larger than that used in the theoretical derivation, 
raising the risk of а Type I error, P(EI), above alpha. Conversely, 
with a platykurtie population (—2 < y, < 0.0), the true c," will be 
smaller than the estimate from normal theory, and a conservative 
test will result. Scheffé concludes that the violation of the normality 
assumption produces only slight effects upon tests of means, but 
dangerous effects on inferences on variances because the theoretical 
distribution may have “. . . the correct location and at least for large n 
the correct shape, but the wrong spread if the уз of the effects differs 
from zero." (Scheffé, 1959, p. 337.) 

Pearson (1931), Geary (1947), and Gayen (1950) pointed out the 
sensitivity of the F test of two independent sample variances to the 
normality assumption. Box (1953) showed that as K increases above 
2, this sensitivity increases. Bartlett (1937) defined the statistic 

М = (N — K) ln M8, — Z(n, — 1) ns,’ 
where 
М5у = Z(n, — 1)s/[Z(n, — 1). 
Bartlett showed that M/(1 + A) is approximately distributed as 
Chi Square, x’, with К — 1 degrees of freedom if 2 is normally 
distributed, and т, > 3 for all k: A = [1/3(K — 1)](2[1/(m — 0] — 
1/[N — K]) where N = Em. 

Box (1953) showed that M is asymptotically distributed as (1 + 
„Буг) хк-12. While the A term rapidly goes to zero аз the minimum 
n increases, уг does not. Since the mean of yx-12 is К — 1, and the 
variance is 2(K — 1); when y; is not equal to zero, both the mean 
and variance of M are affected. Non-normality can produce either 
an extremely conservative test, [P(EI) < a], ог an extremely рег" 
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missive test, [P (EI) > a]. Since ye is in a multiplicative factor, its 
effects increase аз К increases. For example, for large т and a 
slightly lepeokurtic population, ya = 1.0, Box (1953, p. 320) reports 
P(EI)'s when using а = .05 of .11, .136, .176, and .257 for K’s of 2, 
3, 5, and 10 respectively. For greater leptokurtosis, these values will 
tise to even higher levels. For platykurtic populations, they will go 
in a conservative direction. 

Box also shows that both the Ёш (Hartley, 1950) and the 
Cochran (1951) tests are affected by kurtosis in much the same 
manner as the Bartlett test, and concludes that the multivariate 
tests for homogeneity of variance-covariance matrices “would be 
expected to be equally dependent on the assumption of multivariate 
normality” (Box, 1953, p. 331). 

Unfortunately, this list of nonrobust tests about exhausts the 
tests of homogeneity of variance covered in behavioral applied sta- 
tistics texts (Edwards, 1960; Kirk, 1968; Myers, 1966; McNemar, 
1962; Winer, 1962). An exception is the Levene (1960) test re- 
ported in Glass and Stanley (1970, р. 374). Glass and Stanley rec- 
ommend conducting an analysis of variance (AOV) on absolute 
deviations from the group means (hereafter referred to as the L-A 
test). Levene (1960) also proposed doing an AOV on the squared 
deviations (hereafter L-z2), and concluded, from а small sampling 
study, that the choice between these two was a matter of taste. 
Miller (1968) proved that the L-A test is not asymptotically 
distribution-free. Unless the population median is equal to the 
population mean (as in symmetric populations), P(EI) may be 
affected by the form of the population. 

There are at least three other tests not covered in any behavioral 
science text, Bartlett and Kendall (1946) suggested breaking up 
each of the K given samples into sub-samples, computing s? on 
each sub-sample, and doing an АОУ on the variable In s? to test if 
the population variances are equal in the K treatment populations. 
Since the s? distribution has a variance proportional to its mean, 
the In transformation has a definite variance stabilizing effect. 
Ln s? as X in the AOV will meet the additive model of the AOV 
far better than using s? as X. (It is irrelevant if In or logio is used 
since one is a linear transform of the other and the F test is unaf- 
fected.) This test is described in Scheffé (1959, p. 83-87) and 
Odeh and Olds (1959). The test does have the assumption that the 
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уз values are identical in the К treatment populations. However, 
if n, = n, and the sub-sample sizes, Vix, are also identical, the F 
test should be reasonably robust to this requirement. If the va 
differ, the use of a weighted least squares analysis may be more 
appropriate. We shall restrict our comments to the case of equal 
n, and equal v; where mv = n; m = the number of sub-samples in 
each treatment sample. An unanswered question is the optimal 
breakup of the sub-samples. If n — 15, would we have greater 
power using v = 3, m = 5, or the converse? 

Another possible test is the Box-Andersen test (1955). For the 
equal n case where К > 2, Box and Andersen define M’=M/(1+ 
.5G2) where M is Bartlett’s M and 6; = KXka/ (35,2)?. Comput- 
ing this index is not à computation for & hand caleulator since the 
kay are sample estimates of the fourth moment from the mean. For 
each sample, E must compute kg, = [n(n + 1)3z* — 3(n — 1) 
(322)2]/[(n — 1) (n — 2) (n — 3)]. Ga then combines these across 
all samples, and adjusts the M value to compensate for the population 
kurtosis. Again а common value of уз is assumed in all treatment 
populations. М” could be contrasted with the Chi Square distribu- 
tion with K — 1 df, but more accuracy is obtained by using the erit- 
ical values of М from available tables (Pearson, 1958). Вох and 
Andersen (1955) ran a small sampling study on the K — 10 case 
for normal, rectangular, and double expodential populations. The 
М’ test was a considerable improvement over M on the two non- 
normal populations; however, on the rectangular population, 37 out 
of 200 values or 18.5% exceeded the theoretical 90th percentile, sug- 
gesting the behavior of the test under platykurtic populations may 
not be ideal. They also contrasted the power of M' vs. M on the 
normal population. For the one specific point they ran, the powers 
were .815 and .810 for М and М’ respectively, suggesting little poWer 
loss at this point. 

The Foster and Burr © test (Foster, 1964) uses a monotone func- 
tion of the coefficient of variation of the sample variances 48 set 
forth by Box (1954). For equal sample sizes, Q = Хви/ (282). 
Foster provides a table for small df, and a Chi Square approxima- 
tion is available for large df. 

Two sampling studies were run to determine the robustness of 
the several tests of homogeneous variances to deviations from 
normality, and to compare the power of these tests. Both studies 
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_ were run on an IBM-360-44, using discrete populations of 2560 
observations from Games and Lucas (1966). Descriptive indices 
of the six populations are given in Table 1. In all cases, the same 

population was used for a simulated experiment with three inde- 

— pendent groups; ie. the form was homogeneous for any analysis. 
Note that the extreme skew population also is the most leptokurtic. 

_ A given population was read into computer storage. A pseudo- 
random number generator with a cycle of 229 — 1 generated rec- 
tangularly distributed numbers ranging from 0.0 to 1.0. Each of 
these was treated as a cumulative probability, and the correspond- 
“ing X value of the specified population was stored. This was re- 
peated until a set of three independent samples of n observations 
each was generated. All statistics under consideration were com- 
puted ox this random experiment, and each was tested against the 
critical value for 5%, 2.5%, 1% and .1% levels of significance 
(when possible). For some tests requiring special tables only the 
5% and 1% levels were possible. The proportion of times Ho was 
rejected (power) was determined. 
For the non-null conditions, the data of the second and third 
samples were multiplied by a constant to produce larger variances 
than in the first sample. The system used was 


Ya = XaV1+ d(k — 1) 


where Ё is 1, 2, or 3. The resulting population variances were evenly 

врасей and were d units apart. The differences between variances 
n 2 

Were expressed via а noncentrality parameter, $^, defined as ф = 


TABLE 1 
Descriptive Indices of the Populations Used’ 


Population Parameters 

и с Gamma 1 Gamma 2 
1. Normal 31.000 9.924 0.00 —0.10 

| 2. Slight Skew 39.503 12.383 0.45 0.14 

— 8. Moderate Skew 23.483 5.314 0.64 0.53 

| 4, Extreme Skew 19.980 6.370 2.04 6.54 

5. Symmetric Leptokurtic 50.000 7.271 0.00 6.16 

_ 6. Rectangular 0.500 0.289 0.00 —1.20 


~ 
____8 Complete descriptions of these populations are available from the Amer- 
ican Documentation Institute. Order Document #8790 remitting $1.25 for 


~ х 85mm microfilm or 6 by 8 inches photocopies. 
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[2(c,2 — ¢) /K]/cf where о? is the mean of the three variances, 
and cû is the first variance.“ The coefficient of variation, CV, of the 
variances is a transformation of this index: CV = ¢°/(1 + d). The 
index appears adequate for both within and between population 
comparisons of power. 

The first study contrasted the power curves of the Fmax test, the 
Cochran test, the two Levene tests (L-2? and L-A) and the Bart- 
lett test on the several populations. In all cases, К = 3, and п = 6 
in this study. For any one point on the power curve all tests were 
тип on the same data. Different starting points on the random num- 
ber generator, and hence different samples, were used at the several 
power points. Each point on the power curve consists of results from 
1,000 simulated experiments, except for the normal population where 
2,000 were used. 

The second study contrasted the Bartlett test with the Box and 
Anderson test M’, the Bartlett and Kendall test, and the Foster 
and Burr Q test. Three independent samples each consisting of 18 
cases were used in each simulated experiment. In one form of the 
Bartlett and Kendall test, the 18 observations were sub-divided into 
nine subsamples of two cases each, the 528 were computed on the 
two cases, and the In s?'s were used as the measures in an AOV. 
This will be referred to as LEV2. In the other form, six subsamples 
of three cases each were formed. This is referred to as LEV3. One 
complication of the Bartlett and Kendall test on discrete data is 
that a value of sa? = 0.0 is possible, but log 0.0 is undefined. If an 
Sa? was less than .002, this value replaced it. 

Between the first and second study, local improvements in soft- 
ware and hardware permitted increasing the number of simulated 
experiments to 3,000 for each power point on the normal population; 
and 2,000 for each point on the other populations. Other aspects of 
the computer sampling procedure were the same. 


Results 


The estimated risks of a Type I error when using a = .05 are 
shown in Table 2 for each of the six populations. In general the Bart- 
lett, Fmax, and Cochran tests show the trends Box (1953) predicts. 

4 Another approach that implies the above restriction in the above multi- 


plicative model is to define 0? = (т o,2)”*. The common variance value, 
c? is the geometric mean of the group variances. 
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Ral TABLE 2 
Estimated Probability of Type I Errors of Various Tests Using a = .05 


Study 1, К =3,n=6 
Bartlett Fmax 


Population Cochran 1-4 Т-А 


formal .039* .042 .045 .065* .069* 
ght Skew .049 .044 .054 „055 .088* 
oderate Skew .066* .074* .065* .068* .082* 
(тете Skew .244* .220* .255* 077% — .143* 
mmetric Leptokurtic .195* .177* .197* .048 .076* 
ctangular .011* .014* .009* .074* .090* . 

Study 2, К = 3, n = 18 
Population Bartlett Q м' LEV3 LEV2 
043 035 .082* ‚043 038% 
049 042 .083* .043. 041* 
067* .082* .042  .089* 
.405* 371* „076% .041 .039* 
mmetric Leptokurtic .360* .334* .047 .039*  .040“ 
ctangular .002* .001* .002* .048 .042 


Ho: P = .05 is rejected at the .05 level of significance. 


tykurtic populations produce conservative tests, while leptokur- 

produces an inflated P(EIJ. Comparing Bartlett’s test on popu- 
ions 4, 5, and 6 illustrates that increasing n (from six in study 1 
18 in study 2) only exaggerates this tendency. The central limit 
em does not operate to improve the traditional tests on уагї- 


һе Foster-Burr О test shows the same trend as the traditional 
, while the several tests designed to overcome this sensitivity to 
ptokurtosis do indeed provide some relief. The LEV3 test shows 
fewest significant deviations from the 05 level and it and the 
2 test are slightly conservative on all populations, The L-A, 
22 and М” tests all show inflated P (EI) зв for several populations, 
L-A test generally being the worst of the three, rising to a 
1) = .143 on the extremely skewed distribution. Miller (1968) | 
із arguments indicating this condition will not necessarily 
ove as n increases, since the L-A test is not asymptotically dis- 
ution free. The inflated P(EIJ's are much more serious than con- 
ative P(EI)’s, since it is possible for a test to be conservative 
still have higher power than other alternative tests. For exam- 
Games and Lucas (1966) and Srivastava (1959) demonstrate 
the AOV on independent groups for leptokurtic populations is 
ative, but most points on the power curve are above corre- 
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sponding points when the normal assumption is true. Thus it can be 
argued that fewer Type I and fewer Type II errors will be made if 
the populations are leptokurtic, so the test is actually improved by 
this particular violation of assumptions. However, if Р(ЕТ) is above 
alpha, it would be necessary to use a reduced nominal level of sig- 
nificance to control P(EI). This will necessarily lower the power 
curve. 

If only the risk of a Type I error were considered, one could rec- 
ommend running the LEV3 or LEV2 tests in all situations. How- 
ever, a consideration of the power curves of the several tests will 
lead to a very different recommendation. 

The normal population power curves of the five tests in study 1 
are shown in Figure 1. When the variances differ, the Bartlett and 
Fmax tests show superior power over the other tests. The power su- 
periority is even greater at all smaller values of alpha. The L-A 
test has superior power to the Cochran and L-z? tests; however, 
pleasure in this finding must be tempered by the fact that this test 
showed an inflated risk of Type I error when Ho was true, and 
hence started at а higher point. The power curves for the slightly 
skewed and moderately skewed populations are similar to those 
shown in Figure 1, except for a slight rise in initial P(EIJ's for the 
moderately skewed population. This is in line with theoretical pre- 
dictions. Skewness alone is not expected to change the character- 
istics of the sampling distributions of s?, and these two populations 
have kurtosis values only slightly above 0.0 (.14 and .53 respec- 
tively). 

Figure 2 contains the power curves for the tests of study 1 on the 
extreme skew population. All tests show an inflated P(EIJ. Both 
the L-z? and L-A share the characteristic that makes the traditional 
tests questionable, but at a reduced level. The L-z? has the lowest 
P(EI), 077, but rises to a power of only 264 for ¢ = 10. The L-A 
has better power, but an unacceptable P(EI) of .143. This generally 
discouraging picture is repeated on the symmetric leptokurtie popu- 
lation, with the exception that P(EI) on the L-A test is not so se- 
verely inflated and L-A power curve never rises above that for the 
Cochran test. 

Power curves were also obtained on the rectangular population. 
Despite their conservative P(EIJ), the Bartlett and Fmax tests showed 
the highest power for all values of Phi above 4.5. The L-A tes 
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Figure 1. а = .05, п = 6 power curves on the normal population. 


showed the next best power, but again this was accompanied by an 
inflated risk of Type I error (.090). Another major difference from 
previous curves is that the power of the L-z? curve rose substan- 
tially, while the Cochran test attained a power of only 185 at ¢ = 
10, and was the least powerful of the six tests. 

The Cochran test showed less power than the Bartlett or the 
Fux tests on all populations. This is partly due to the fact that the 
three population variances were evenly spaced. The Cochran test 
_ Merely compares the largest sample variance to the sum of the sam- 

Ple variances. Winer contends that “Since the Cochran test uses 
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Figure 2. а = 05, п = 6 power curves on the extreme skew population. 


more of the information in the sample data, it is generally some- 
what more sensitive than is the Hartley test” (1962, p. 94). The con- 
sistently lower power of Cochran’s test in this study suggests it 
should be used only if the populations have a kurtosis near 0.0 and 
Е expects one deviant large variance, and k-l relatively homoge 
neous variances. Sacks (1969) also recommends the Cochran test 
over Fmax When К > 12. 

The Levene tests were overall a disappointment. They do not 
show the hoped-for robustness to form violations, and generally 


| 
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have low power. The L-2? power was even lower when a = .01, never 
reaching a value of .10 while the Bartlett and Fmax consistently 
reached .6 when ¢ = 10. A study published after the present study 
was completed confirms this characteristic. 

Miller (1968) shows the asymptotic relative efficiency of the L-z* 
test is the same, when K = 2, as that of the Box and Anderson and 
jackknife tests (Tukey, 1969). However, on sample sizes of 10 
and 25, he found the L-z* not so powerful as the other tests, partic- 
ularly when using the .01 level. Comparing his results with those in 
the present study, the relative inferiority of the L-z^ test decreases 
as ns increase. For the n = 25 case, the power is somewhat lower 
than that of other tests, for n = 10, it is considerably lower, and with 
our = 6, it is far lower. Probably this is related to the fact that 
the values of (X;, — Х,)* within a sample are not independent, and 
the degree of dependence between values is larger for small n’s. 
The same problem exists in the values of A4; = |X — X,| which 
are the inputs to an AOV in the L-A test. 

The results in the second study were more encouraging. On the 
normal population, « = .05 power curves showed little differences 
between the Bartlett, М”, and Q test, since all reached near the 1.0 
ceiling at ф = 447. The а = .01 test showed greater differences be- 
tween the tests, and is shown in Figure 3. м 

The curves for the slightly and moderately skewed populations 
were very similar to Figure 3. The Bartlett, M’ and Q tests con- 
sistently showed the greatest power. The LEV3 test showed the 
best control of P(EI) on all three populations. Unfortunately, this 
was accompanied by a lower power as shown in Figure 3. The LEV2 
test showed a far greater loss in power. On populations with only a 
mild degree of leptokurtosis (regardless of skewness), the Bartlett 
test remains unsurpassed. 

Figure 4 describes the results obtained on the extreme skew popu- 
lation. Results on the symmetric leptokurtic population were simi- 
lar. Note the Bartlett and the Foster-Burr test reach a 37% risk of 
Type I errors, The М” and LEV3 tests have reasonable power curves, 
although M* has a P(EI) = .076. Again the LEV2 was conservative, 
and lowest in power. The а = .01 curves showed similar trends in 
both populations, including the crossing of the М’ and LEV3 power 
curves, 

The most unexpected results came on the rectangular population. 
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Figure 3. а = 101, п = 18 power curves on the normal population. 


The М” statistic was conservative in contrast to the results fro? 
Box and Anderson’s small study (1955). At higher values of ¢, how 
ever, the M’ test actually decreased, and the difference between 
power at alphas of .05 and .01 was never greater than .001. T 
must be a function of the computation of G in the formula г 
M/(1 + .5Gs) since no such results are observed in the Bartlet 
test, which is also based on the value of M. This result wa E 
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shocking, we carefully inspected the computer program to determine 
if limited computer accuracy could have produced this result. АП of 
the higher powers used in the computations of the k4;'s and Ska, 
were done in double precision, with approximate 16 digit accuracy 
in all steps. The denominator of G2, (3?s;?)? was unfortunately 
done in single precision. Apparently, the combination of only seven 
digit accuracy in this step, plus the fact that the rectangular input 
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Figure 4. а= 05, n = 18 power curves on the extreme skew population. 
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Figure 5. а = 05, п = 18 power curves on the rectangular population. 


was a seven digit approximation to a continuous measure (as COP’ 
trasted with the 2 digit input of the discrete populations) produced 
this artifact. 

The second study carried over the Bartlett test as the "bench" 
mark" test for homogeneity of variance. The Foster and But Q 
test showed very similar characteristics to the Bartlett, with an 2° 
flated P(EI) that makes it relatively useless for leptokurtio popul? 
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tions. It typically had a slightly lower power than the Bartlett test 
at moderate power levels, especially when using a = .01. There 
seems to be no reason to advocate it over the Bartlett test or the 
Fmax: 

The LEV2 test has unsatisfactory power. The LEV3 test, how- 
ever, was significantly conservative only on the symmetric lepto- 
kurtic population, and had reasonable power on all populations 
when using о = .05. The LEV3 and the Box М” were the two tests 
that maintained reasonable control over P(EI) over all populations, 
with the LEV3 doing the better job in this respect. The Box and 
Andersen М” had superior power to the LEV3 on the first three 
populations with y's of —.10 to .56. However, these are the popu- 
' lations where the Bartlett and Fmax tests work well. On the highly 
- leptokurtie populations, the LEV3 had a superior power curve, with 
_ P(EI) = с at the beginning, and with superior power at large values 
of $. This test thus stands as the best alternative under conditions 
where the Bartlett and Fmax tests are dubious because of suspected 
leptokurtosis. 

Recall that the Bartlett and Kendall (LEV) test is based on 
т subsamples of v cases each. Each 8° is computed from X,, values, 
‚ by Zx*/(v — 1) and AOV is carried out on the Km values of In $°. 

The values used in the LEV3 form were m = 6, v = 8. Figure 6 

Contains the LEV3 power curves of four of the populations. The 

Power of the test is influenced by the kurtosis of the population, 

although the Type I error rate is not (for large n). Scheffé (1959, p. 84) 

Shows that аза c (2/(v — 1)) + v/v. Since the expected value 

of MS, is c^,, in an analysis of variance on In з", an increase in у» 

Will lower power (other things constant). Thus unfortunately, on the 
highly leptokurtic distributions where the LEV3 test is needed most, 

it will have somewhat lower power than on other populations. · 

Discussion 
The results of the two studies suggest the following general ad- 
vice. If E has reasons to believe the populations sampled are platy- 
ie or mesokurtic (уг < 0.5), he should use the Bartlett 

Or Р.а tests, Although it is true that these tests will be conservative 

or the platykurtic populations, they have greater power than any 
| tltemative test for most points on the power curve. One of the en- 
| “Ouraging results of the present study is that the great decrease in 
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Figure 6. п = 18 power curves of the recommended Bartlett and Kendall 
test (LEV3) contrasted with the most conservative Bartlett test. 


Р(ЕТ) on these tests is not accompanied by similar decreases in 
power. The top curve in figure 6 is Bartlett’s test on the rectangular 
population, with а = .01. Even with reduced alpha, Bartlett's test 
has greater power than the LEV3 test on the same populations. 
However, if E has reason to suspect leptokurtosis in the popula” . 
tions, or has no a priori indication of form and wishes to specify | 
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Р(ЕТ) = .05 for any shaped population, the LEV3 method is rec- 
ommended. The lower power of this method can always be compen- 
sated for by an increase in sample size. If inferences about vari- 
ances are a primary goal of the investigation, n's above 20 would 
usually be advisable. Table 3 illustrates the computation of the test 
on a sample of cases. The LEV3 test is preferred over the other ro- 
bust competitors because it showed the best control of P(EI) for the 
leptokurtic populations, and had greater power than any other test 
with similar control. 

The biggest question in the application of the LEVS test is: Given 
K samples of n cases each, how many subsamples, m, should be 
used? The degrees of freedom of the MSw will be K(m — 1). An 
increase in m yields a lower critical value of F, and greater power. 
However, E(MSw) = 2/(v — 1) + y2/v, во a larger v decreases 
Му and raises power. Since v = n/m, these two determinants of 
power work in opposite directions, and an optimal value of v (or m) 
for a given n is needed. An exact solution is not available, but an 
initial approximation was made using the Pearson and Hartley 
(1951) charts of the power function for analysis of variance. Let y 
be the noncentrality parameter of these charts. For an AOV of K 
sets of m independent observations each, each observation being 


тд _, me 
rome NT [ae 
v—1 v 


в= У (Inc? - n/K 
Inc = Dino /K 

To compare the effect of different т and v combinations for fixed п, 
\/@ was taken аз a fixed constant, and the power of various combina- 
tions was approximated using the charts. All ns from 12 to 36 that 
had any two of the numbers 3, 4, 5, and 6 as factors were explored 
for К = 3 and К = 5. Only integer values of v and т were used. 
Setting y, = 6.0 to represent a highly leptokurtic population where 
the LEV test would be desirable yielded results suggesting that v = 3 
Would result in maximal power up to n = 18, and very little loss in 
Power up to n = 36. At large mws, very little power difference is 
likely whether v = 4, 5, or 6. Using а тз value of 0.0 yielded similar 


Where 
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results, however, with a suggestion that above n = 18, using = 4,5, 
or 6 might produce a slight improvement in power over v = 3. 
Asymptotic theory (Miller, 1968) suggests larger values of » would 
be desirable for very large n. The above analysis does suggest inferior 
power when using v = 2, as was found in the LEV2 results from 
the second study. 

ТЕ n is such that it is not evenly divisible by three or four, ete., the 
Bartlett and Kendall test can be carried out with uneven size 0,8. 
E.g., if л = 22, Е may use seven subsamples; віх of three observa- 
tions each, and the seventh of four observations. The analysis may 
be conducted as usual on the seven values of In s? ignoring the differ- 
ential size of the subsamples. A slight gain in power can be secured 


in such situations by the use of special weights reflecting the differ- | 


ent variability of the seventh subsample (Scheffé, 1959, p. 86). 
Whether the extra power is worth the extra computational labor isa 
question on which the present report sheds no light. 

А problem with usage of the LEV test on small samples is that 
great decreases in power would occur as K(m — 1) becomes small, 


or if v = 2 is used. (Miller, 1968) recommends the Box and Ander- 


sen test when n < 15 (K = 2). When K = 3, the use of the LEV 
test with v = 3 down to ws of 12 seems reasonable. Below this 


point the cautious E may have to face the horrors of the computa- | 


tion of the Box and Andersen М”. 

The use of the Bartlett and Kendall test has side benefits that 
prove as useful as the test itself. The overall rejection of Ho: о? = 
02? = «++ ок? alone is not adequate for investigations that are in- 
terested in which particular variances differ. If the Bartlett or Foss 
test is appropriate and significant, a follow-up by F = 57/ s is 
needed to determine which particular variances differ. This proce 
dure has the same objections as the Ё — £ sequence on means, ie, 
the familywise risk of a Type I error rises as the multiple tests are 
performed in the second stage. 

The data generated by the Bartlett and Kendall test may be used 
in an F — ¢ sequence that is the logical equivalent of the above tests, 
or it may be modified by using a multiple comparisons replacement 
instead of the t test. Letting У = In s* when the Bartlett and Kend: 
is concluded we have M Sy and К values of У,. То come to conclusions 


about pairs of o,”s, E merely computes the usual confidence intervà 


for Y, — Y, The interval is (Y, — Ў) + te V2MSw/m. If this 
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rval does not contain a 0.0, Е may reject Ho : c/c; = 1, or 
a; = с. If it does contain 0.0, Ho must be retained. If E con- 
this confidence interval by use of anti-logs, he will have an 

‘approximate confidence interval for the ratio c;*/s;". 

If E wishes to be more conservative, and compute a set of con- 

‘fidence intervals that have a joint or familywise risk of Type I 
| error of а, he may use the Tukey WSD method (Miller, 1966). The 
confidence intervals are simply computed as (Y, — У) + 
lax М Мбу/п where „к is obtained from the studentized range 
tables (Myers, 1966, p. 398). The whole catalogue of multiple 
comparisons procedures may be applied to the Y's as desired, thus 
_ extending the kind of questions that may be properly handled. For 
| example, reducing the class variance may be viewed as a legitimate, 
f Secondary, educational goal. If E has three treatments and а 
“control group that is the “usual procedure,” he may test if any one 
of the three treatments produces a variance significantly below that of 
the control group by use of Dunnett’s (1955) procedure. The use of 
"general contrasts would permit testing special hypotheses (Kirk, 
1968), 
| The use of У = In s* in an AOV corresponds to a multiplicative 
‘Model of variances that might have some theoretical usefulness. 
Inan AOV, we use the additive model m, = р + ть with the restriction 
that Zr, = 0.0. If we impose this model on In о’, we have In су. = 
Ino” + In z,. The restriction 2 In т, = 0.0 is equivalent to vr, = 1, 
the [Че of the туз is one’. The model for any cell variance is 
a = 7,07. Since variances constitute a ratio scale (a meaningful 
zero point exists), a multiplicative model is reasonable, and one 
Possible way to handle differences in magnitude. Bechofer (1968) 
орозез a multiplicative model for variances in multidimensional 
AOV situations. 
_ 16 has been the senior author's experience that variances are 

ften ignored in investigations, even when the theory involved has 
Clearer predictions about variances than about means. This is prob- 
ly because most students primarily encounter variances in terms 
nasty assumption when doing a £ test or an AOV, or they have 
ed that tests on variances are untrustworthy due to sensitivity 
| form. Although the Bartlett and Kendall test has lower power 
the Bartlett or Fmax, it has demonstrated great robustness to the 
ality assumption in this study, and in Miller's study (1968). 


0 
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TABLE 3 


An Example of the Computation of the Bartlett and Kendall Test 
and Follow-up Multiple Comparisons 


А. Original data (random order down columns) and ss. 


Ха Ха Ха 
sat зи? 53% 
14 10 31 
9.0004 8 19.0004 9 120.333410 
11 17 15 
10 12 15 
4.000414 12.000412 111.000436 
12 18 24 
11 17 16 
4.0004 9 30.333122 82.333323 
13 11 34 
14 27 27 
2.917 16 22.917) 16 72.000) 9 
15 20 25 
12 18 15 
В. У = logus? entries for AOV and multiple comparisons. 
Т1 Т2 ТЗ 
.95424 1.27875 2.08038 IZZY = 15.7219 
.60206 1.07918 2.04532 IZY? = 24.328054 
.60206 1.48191 1.91558 SStot = 3.7299 
.46494 1.36016 1.85738 88у = 0.2506 


У = .65582 1.30000 1.97465 55 вт = 3.4793 


Fe= МВвт _ 3.4793/2 _ 1.7396 


= 62.11. 
MSy .2506/9 .02785 


Reject Ho: ci? = oè = ст. 

To test Ho: с? = cs! or Ho: с12/с2 = 1, at the 5% level, we construct the ap- 
proximate 95% confidence interval: (fi — Vs) + to V2MSy/m = —.6442 = 
2.262 V.013023 = —.6442 + .2669. Now C(—.9111 < log ой — log c? < 
—.3773) = .95. Since this does not contain 0.0, Ho is rejected. Converting back 
from log form, C(.1227 < с12/с12 < .4195) ~ .95 so the variance in population 
is less than half of that in population 2. The same process can be used to contrast 
oy with сз and ex! with сз. If E desires to construct the complete set of three 
intervals so that the risk that any one of the three population variance ratios p 
not included in its interval is .05, he simply uses the Tukey WSD form: (Y: 7 
Y) + qu VMSy/m = (Y. — Y; + 3.95У.0069619. = (Y; — Ру) = 3206. 
This is a more conservative criterion, so we have a wider interval. Applying this 
to Ho: c/c? = 1. we get C(.1676 < log c? — log e»? < .8268) = 95 of 
C(L471 < с/с < 6.711) ~ .95. Again Но is rejected. The larger variance 
was placed on top in this second interval to illustrate the two kinds of outcomes: 
If group 3, say, were a control group indicating “normal” variability, it wo 
be wise to use оз? аз the denominator in two contrasts, and to use the Dunnet 
table values instead of t or q. 
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In addition it has the great virtue of permitting the student to trans- 
fer all of the skills of AOV to new applications on variances. The 
flexibility of general contrasts, multiple comparisons, ete., are all 
available for testing hypotheses about variances. Students should 
be encouraged to use this test, even if they sometimes pay a price in 
reduced power. 
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DIFFICULTY FACTORS, DISTRIBUTION EFFECTS, 
AND THE LEAST SQUARES SIMPLEX 
DATA MATRIX SOLUTION? 


JOS M. F. TEN BERGE 
University of Groningen, The Netherlands 


In his handbook of factor analysis Horst (1965) devotes a chap- 
ter to the subject of “Factor analysis and the binary data matrix.” 
He introduces the Least Square Simplex Data Matrix Solution 
(LSSDMS) to solve Ferguson's dilemma: that is, the rank of a 
matrix, that can be regarded as a perfect Guttman scale, equals 
the number of distinct item-preference values (p-values), whereas, 
as far as the content of the items is concerned a rank of one 
would be expected; or, component analysis of a perfect Guttman 
scale yields one component “difficulty factor" for each item-p-value 
(Ferguson, 1941). 

The rationale of the LSSDMS is as follows: if the dimension- 
ality due to the p-values is eliminated from the data matrix, the 
residual matrix can be analysed free from the so-called difficulty 
factors. This idea is worked out as follows: 

The rows and columns of an experimental (observed) binary data 
matrix y are permuted so that similarity to a Guttman scale is max- 
imized. Next, one postulates a latent simplex (Guttman scale) z, 
which represents the difficulty factors in a pure form. The latent 
simplex z is transformed so that the variance jn the residual matrix 
€= y — zB is minimized: R 

trace (e'e) = min. 
The least-square solution for the transformation matrix B equals 
В = (тт) xy 

1The author is obligated to Hofstee, Kluiter and Molenaar for critically 

Teviewing an earlier draft of this paper. 
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‘which is the well-known matrix of regression vectors for estimat- 
ing a criterion matrix y from a predictor matrix 2. In the purified 
residual data matrix, all columns (variables) are independent 
of the simplex-variables: 


we = 0 
An Example 


The LSSDMS can be illustrated by an example of Horsts 
‹ (1965, р. 520 ff.). For an observed (experimental) matrix y (10 
individuals, 4 variables) a simplex x (10 individuals, 3 variables) 
is postulated. From the residual matrix, the correlation matrix is 
computed, which can be analysed free from difficulty factors (Ta- 
ble 1). 
A Computational Alternative 


Horst (1965) has developed a somewhat complicated compu- 
tational procedure for the LSSDMS. However, since the LSSDMS 
amounts to а simple linear regression analysis one might save con- 
siderable programming time and computation time if the sim- 
plex variables would be eliminated in the conventional way. 
Thus, by the same token the diagonal method of factor analysis 
(e.g., as described by Nunnally, 1967; p. 155 ff.), could be used. 
In that case, the procedure would be as follows: 

The simplex and experimental data matrix are joined, and the 
correlation matrix is computed. From this, the simplex variables 
are eliminated one by one. 

To illustrate this, we apply the procedure to the above-named 


TABLE 1 
The LSSDMS 

Observed data Correlation matrix independent of 
ТҮ ТОТ Те 1 2 3 4 
1 3 14 Y x 1 1.000 
04.11 ит 2 .000 1.000 
101501 ER 3 .000 —.745 1.000 
О TRUE 4 100 —.439 .196 1.00 
ПОЗ 110 
1100 YU 
1100 100 
1010 100 
0000 000 
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example from Table 1: In the south-east corner of Table 2 the 
same residual correlations appear as in Table 1. 

If anyone would want to use the LSSDMS, computation by the 
diagonal method is advocated from the point of view of efficiency. 


There is No Justification for Using the LSSDMS 


The LSSDMS is introduced by Horst as a method for eliminat- 
ing that part of the dimensionality that is accounted for by the 
item-preference-dispersion: The binary data matrix is to be pur- 
ified by means of the LSSDMS, before being factor analyzed. 

In the present article, it is argued that the LSSDMS does not 
deal adequately with difficulty factors. In the author’s opinion, 
the theoretical foundation of the LSSDMS is insufficient. This 
point can be made clear by a further analysis of two phenomena: 
on the one hand there is the effect of distribution shape discrepancy 
on correlation, and on the other, the occurrence of a difficulty fac- 
tor. 


The Effect of Distribution Shape Discrepancy on Correlation 


“It has long been known that only if two binary variables have 
equal preference values is it possible to have a correlation of one 
between the two” (Horst, 1965, p. 514). This statement is not 


TABLE 2 
The Diagonal Method of Factor Analysis 


Correlation matrix of simplex (1, 2, 3) and observed variables (4, 5, 6, 7) 


1 2 3 4 5 6 7 
1 1.000 
2 .509 1.000 
8 .333 .655 1.000 
4 1.000 .509 .833 1.000 
5 "309. 524.655. .509 1.000 
6 .408  .856 408  .408 —.089 1.000 
7 333 65 600.333 1218 (408 1.000 
Residual correlation matrix independent of simplex variables 
2 3 4 5 6 т 
p ЯЕ, т? i 1.000 
sing NE EN = —.745 1.000 
MEI Hr 2 = —.439 —.196 0 


~ со сз ھر‎ со по = 
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false, but it certainly is misleading. The suggestion is that two 
items correlate less than perfectly because their preference or diff- 
culty values are different. From here it is only a short step to 
starting to talk in terms of difficulty factors: when items, given 
their p-values, have as high a correlation as possible, factor ana- 
lysis yields as many factors (principal components) as there are dis- 
tinct difficulty values of the items (Ferguson, 1941). At this 
point it is easily forgotten that means (as well as variances) are 
irrelevant to product-moment correlation. It is an idiosyncracy of 
binary variables that leads to this sloppy thinking: the p-value 
of an item determines the distribution shape. Items with unequal 
p-values must have different distribution shapes (in terms of 
skewness) and, to the latter discrepancy, product-moment corre- 
lation is certainly sensitive: We have here a general property of 
product-moment correlation, that is not restricted to the phi- 
coefficient (Carroll, 1961). 

Distribution shape difference tends to have only slight effects 
in studies of continuous variables; when dichotomous variables 
are used, however, the effect can be quite large (Nunnally, 1967, 
p. 130). 

In using dichotomous variables one can establish the presence 
of distribution effect from the phi-correlation matrix, where the 
items are arranged in increasing order of p-value. 


1. Since the more items differ with respect to their p-values, 
the lower the phi-correlation will be, the presence of distribution 
effect will be evidenced by the tendency of higher values to centre 
around the diagonal. Lower values of phi will be found in the 
south-west and north-east corners of the matrix. 

2. Since the effect appears stronger, the further p-values depart 
from .5 (Carroll, 1945, p. 16), other things being equal, the highest 
values of phi will be found for the item with р = 55, and, with 
increasing eccentricity (Cattell, 1952) of p-values, the values of 
phi will decrease. 


If row or column sums of the phi-matrix are graphically de- 
picted as a function of the item-p-values, a unimodal curve vil 
be produced, with a maximum in the region of the median p-value. 
Item loadings on the first centroid factor (unities in the diagonal) 
will, in the same way, form a unimodal function of the item-p-Yal" 
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Both distribution effects can be easily recognized in any equally 
spaced matrix of phi-max coefficients. 

In a similar fashion one might try to establish distribution 
effects for continuous variables: variables, arranged according to 
skewness, might yield an R-matrix with higher values around the 
diagonal. As has been mentioned, however, the effect is quite 
small in the case of continuous variables (see also Carroll, 1961). 


Distribution Factors 


If a phi-correlation matrix is dominated by distribution effect, 
the loadings on each principle component can be represented as 
simple functions of the item-p-values. The loading on the first 
principle component forms a bell-shaped function of the item-p- 
value, with a maximum in the region of the median p-value. The 
loadings on the second principal component form an approxi- 
mately linear function of p, those on the third and following com- 
ponents form polynomials of increasing degree in p. 

Principal component scores show a similar law of formation: 
"scores on the n-th component approximately form a function of 
the n-th degree in the composite score of individuals. One may 
compare the findings of Ferguson (1941) and Guttman (1950, 
1954), respectively. 

The factors produced by a distribution effect are commonly 
teferred to as difficulty factors; e.g. Ferguson (1941), Gourlay 
(1951). Guilford (1963), Digman (1966), and Kaiser (1970). After 
this point we will denote such factors as distribution factors: 
They arise when variables, correlated as highly as their distribu- 
tion shapes permit, are factor analyzed. Distribution factors are 
artifacts, since they do not tell us anything about the nature of 
the underlying trait(s). They do tell us that factor analysis does 
| Not offer any parsimony in describing the data at hand. 

Guttman’s “Principal Components of Scalable Attitudes” are 
also distribution factors. Guttman (1950, 1954) tried to give 
Psychological interpretations to them, but how could mathematics 
- tell us anything about the nature of attitudes? 


Difficulty Factors 


An MeDonald (1965) defines а difficulty factor, for his purposes, 
| Зза factor on which the loading correlates with the mean of the 
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variable. This definition is not sound in general. At least three 
instances of correlation between the loading and the mean of the 
variables should be distinguished: 


1, Factor loading and mean may correlate as a consequence of 
distribution effect: the second distribution factor (see above) 
satisfies McDonald’s definition of a difficulty factor. 

2. Factor loading and mean may correlate because the varia- 
bles with higher (or lower) mean scores exhibit more variance on 
some particular property. For instance, when a test for verbal pro- 
ficiency is administered to a group of verbally less-gifted people, 
the subtests with higher means will tend to have higher loadings 
on the general verbal factor. Finding a pseudo-difficulty factor in 
this manner is possible, especially in the study of continuous var- 
iables; the first instance above pertains mainly to binary variables. 

3. The attribute referred to in case two may concern “the capacity 
for solving difficult problems, regardless of content.” It is our 
suggestion that only in this instance does the term difficulty factor 
truly apply. McDonald (1965) substantiates a difficulty factor in- 
terpretation by stating that the variables he was studying were 
known to be tasks of increasing logical complexity. Only such 
а context-specific argument, based on the meaning of the variables, 
can justify the use of the term difficulty factor. It may be noted 
that a difficulty factor, as it is defined here, can hardly be dis- 
tinguished from a factor like general intelligence. 


Correlation of loading and mean does add to the plausibility of 
a difficulty factor interpretation, but by itself it is not a sufficient 
condition, as can be seen from cases 1 and 2 above, nor is it a neces- 
sary condition: Variables of identical means may very well diverge 
in the degree to which they draw their variance from (and there- 
fore correlate with) some particular attribute; this is also possi- 
ble when this attribute happens to be "capacity for solving diffi- 
cult problems.” 

Contrary to the case of the distribution factors, one should not 
expect to find more than one difficulty factor, since all difficulty 
variance can be accounted for by one factor. Nor is the difficulty 
factor an artifact. There is no need to eliminate the factor (6:57 
by means of the LSSDMS), and it is not necessary to exclude it 
from factor interpretation. 
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big. The LSSDMS Fails to Eliminate Distribution Factors 

Тһе LSSDMS is aimed at eliminating the factors which we 

termed distribution factors. The idea was to operationalize these 
factors in the simplex. However, it is by no means clear how 
distribution factors can be contained in the simplex. The simplex 
has no conceivable relationship with the distribution factors. 

- Whereas the LSSDMS treats each single variable as having com- 

ponents that can be partialed out, it takes two (differently 
distributed) variables for distribution effect to be defined, 

It may be instructive to consider the following hypothetical 
Саве: If the experimental data matrix would be a perfect Guttman 
scale itself, then the LSSDMS would bring all intercorrelations 

= Closer to zero. In fact, all pairs of variables that have p-values on 
_ both sides of anyone of the simplex p-values will have vanishing 
residual correlations. The proof is based оп Guttman’s simplex 
theory (Guttman, 1954, p. 272 ff). When each binary variable 
in a perfect Guttman scale is assigned the parameter 


a= UI, 


| Afunction of its p-value only, then the correlation 


1;,(phi-max) = т for р; > р, 


Using this simple expression, one can readily infer some inequal- 
ities concerning first- and higher-order partial correlations in 
- eliminating the latent simplex from the perfect Guttman scale. It 
is seen that the LSSDMS works in the wrong direction: distribu- 
tion effect kept correlations too low; the LSSDMS will render 
them even lower in this hypothetical case. 

In addition to the LSSDMS, Horst presents a Least Square 
Simplex Covariance Matrix Solution (Horst, 1965, р. 522 f). 
This method requires subtraction of the regression on the simplex- 
“Covariance matrix from the experimental covariance matrix. Here, 
too, recognition of the fact that Ferguson’s dilemma stems from 
distribution effects, is lacking. This method will be omitted from 
Consideration. 
М Factoring the Binary Data Matriz 

4 _ The sensitivity of product-moment correlation to differences in 
distribution shape does not prohibit by itself factoring binary 
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data. This is because most binary items used by psychologists 
correlate far below the ceiling values imposed by distribution shape 
discrepancy. Consequently, in most cases one will easily accept 
distribution shape discrepancy as an indication of non-equivalence 
of the items. 

The picture changes when intercorrelations are close to their 
maximum values. Each item will give rise to one factor. Using 
the product moment correlation consistently here can be done 
only at the expense of parsimonious (unidimensional) descrip- 
tion of the data. There are three kinds of solutions to this dilemma: 

First, one can abandon the model of factor analysis in favor of 
the Guttman scale model, for which probabilistic versions are 
now available (Mokken, 1970; Proctor, 1970). The difficulty is 
that one must determine to what extent the data matrix resem- 
bles a Guttman scale, before one abandons factor analysis. Al- 
though in theory, the coefficient of homogeneity (Loevinger, 
1947), the Measure of Sampling Adequacy (Kaiser, 1970), or а 
method for fitting the perfect simplex (Schónemann, 1970) might 
be used as stepping stones, the possibility of finding useful critical 
values seems slight. 

In the second place, instead of using phi, one could use the 
ratio phi/phi-max or the tetrachorie correlation. Both indices 
are perfect when the items, given their distribution shapes, corre- 
late maximally, so that factor analysis would yield one factor. For 
a discussion of these coefficients one may consult Wherry and 
Gaylord (1944), Carroll (1945, 1961), Smith (1951, 1955), Gut- 
tman (1953) and Lord and Novick (1968). 

It is conceivable that the ratio phi/phi-max and the tetrachorie 
correlation, despite their lack of elegance, are useful aids in 
making decisions. Empirical research may show that item selec- 
tion based on factor loadings computed from either of these in- 
dices will yield the same scales as would be obtained from product- 
moment correlations between 7-point variables. 

Finally, one may retain the phi coefficient but choose а model 
of factor analysis that does not yield distribution factors. Horst’s 
suggestion, to factor the residual matrix after application of the 
LSSDMS, would fall into this category, as well as a recent pro- 
posal by Kaiser (1970, p. 407). Kaiser asserts that image analy- 
sis of the binary data matrix may avoid distribution factors. He - 
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makes referral to “a crude appeal to the Central Limit Theorem.” 

This appeal becomes quite crude if the data matrix itself is a per- 
fect Guttman scale. In this case the images are linear composites 
of at most two items. This follows again from Guttman’s simplex 
theory (Guttman, 1954, p. 291). Distribution effect will still show in 
the image covariance matrix, and distribution factors remain, 

However, it should be said that in order to rid ourselves of dis- 
tribution effect, we do not need (multivariate) normal distribu- 
tions. Distribution effect decreases as the distributions become 
more alike, especially with respect to skewness. Therefore, it may 
be speculated that the images, being less skewed than the original 
items, will be less sensitive to distribution effect. 


Summary 


Horst raises Ferguson’s dilemma again. He introduces a method 
for eliminating distribution factors. The method fails to deal 
adequately with these factors. A distinction is made between 
factors resulting from the effect of distribution shape discrepancy 
on product-moment correlation, and a factor referring to an at- 
tribute like “capacity for solving difficult problems.” Some well- 

- known alternative solutions for the problem of distribution factors 
are mentioned. 


| 
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RASCH’S LOGISTIC MODEL Ys. 
THE GUTTMAN MODEL 


NICHOLAS E. BRINK 
The Pennsylvania State University 


Tux purpose of this study is to compare the Rasch and the 
Guttman models of measurement and thus add to the description 
of the characteristics of Rasch’s logistic model. Such knowledge 
is of importance in making decisions as to which model and 
which statistics should be used in evaluations of tests. 


Rasch Model 


Work has just begun in describing the characteristics of data 
that produce good fit to the Rasch model. In her dissertation 
Panchapakesan (1969) has explored the effect of varying item 
discrimination, the presence of “bad” items, and the effect of 
Buessing on the model. She found that inequality of item discrimi- 
nation, lack of unidimensionality, and variation in guessing de- 
creased fit to the model. The statistic used in evaluating fit to 
‘the Rasch model is the chi-square test for goodness of fit between 
observed data and expected values of that data. For a clear pres- 
entation of the Rasch model and this test of goodness of fit see 
- Wright and Panchapakesan (1969). 


А Comparison 


In asking the question, ^What do data with perfect fit to the 
‘Rasch model look like?" characteristics of other measurement 
Models were considered. Similarities were noted between the 
uttman and Rasch models. While the Rasch model scales items 
On easiness and subjects on ability, the Guttman model orders the 
items on difficulty and the subjects on total score. 


921 
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These similarities do not continue, however. Some of the more 
evident differences between the models are: (a) The Rasch model 
alone determines estimates of subject ability and item easiness 
that have ratio scale characteristics, 1.е., the test parameter scales 
have zero points that actually represent no ability and no easiness. 
(b) For the Rasch model the ability estimates are independent 
of the item easiness and the number of items. This is not the case 
with Guttman scaling where scores are the total number of correct 
responses and thus dependent upon both the number and difficulty 
of items. (с) The Rasch scaling procedure also produces ability 
estimates independent of the sample of subjects used to calibrate 
the item easinesses. This independence has been well illustrated by 
Wright (1968). Again the Guttman model does not atempt to 


produce this objectivity. (d) The Guttman model does not - 


attempt to make assumptions more restrictive than that the 
data are ordinal on the two dimensions—i.e., item difficulty and 
total score. Conversely, the Rasch model is a latent trait model, 
a model that estimates the persons’ underlying trait. (e) One of 
the findings of Panchapakesan (1969) is that equal item dis- 
crimination is necessary for good fit to the Rasch model. Figure 
la represents the item characteristic curves (ICC) for five items 
over a range of ability. These curves represent the probability of 


FIGURE 1 
Item Characteristic Curves 


1 


= 


ability ability 
Figure 1а. Perfect Rasch Scale. Figure 1b. Perfect Guttman Scale. 
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responding correctly to an item if given the subject’s ability. For 
equal item discrimination, these curves need to be parallel. This 
is not necessary for Guttman scaling. Instead Guttman scaling 
requires that these curves be discrete or nonoverlapping on the 
range of ability measured by each item (Figure 1b). 


Stimulation I: Perfect Guttman Data 


Since the data of both models are matrices of ones and zeros, 
correct and incorrect responses, and in that both models are con- 
cerned with two test parameters, item easiness and subject ability, 
the question still remains whether or not data with good repro- 
ducibility (fit to the Guttman model) will also show good fit to the 
Rasch model. If the same data should fit both models then the 
Rasch model would seem to be the superior model, since the 
Rasch model provides more information for each person and 
each item. 

In order to provide control over wanted and unwanted varia- 
bles, simulated data were used. T', person j’s total score, was 
generated as a uniformally distributed random variable with a 
range from 0 to 64. Еу, the error found in person j’s response to item 
1, was generated as a normally distributed random variable with 
а mean of zero and a variable standard deviation. This standard 
deviation was varied to change the degree of reproducibility of 
the Guttman data. A standard deviation of zero produced data 
representing a perfect Guttman scale. For person j, if the item 
number, i, was less than 7; + Hy then the response to that item 
was considered correct. Otherwise the item was incorrectly an- 
swered. Data representing the responses of 1000 persons to 64 
items were generated. 

Data were simulated with the error standard deviation equal- 
ling zero. In submitting this perfect Guttman data to the Rasch 
analysis a problem developed. In that items either answered 
correctly or missed by all subjects are beyond the range of the 
calibration sample the easiness of these items cannot be deter- 
mined, When this occurs the items are deleted. Similarly, when a 
subject misses all items or answers all items correctly the ability 
of these subjects exist somewhere beyond the range of ability 
measured by this set of items and thus cannot be determined. 
Again, these subjects are deleted from the response matrix, leaving 
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another subject or item to be deleted on the next round of trun- 
cation. The result is no data to be analyzed by the Rasch model. 


Simulation II: Near Perfect Guttman 
Data and Random Data 


To avoid this problem near perfect Guttman data were gener- 
ated with enough random deviation from a perfect Guttman scale 
so that few or no items or subjects would be deleted. This was 
done by increasing the standard deviation of the error variable 
Ey. The effect of this on the data matrix is to increase the prob- 
ability (from zero to one half) of a subject missing an item below 
his ability level and, similarly, to increase the probability of 
passing an item above his ability level. In that this error was nor- 
mally distributed this probability diminished as the distance of 
the easiness of the items from an item measuring the subjects 
ability level increased. Data sets were generated with this stan- 
dard deviation increasing from 8 to 320. With large values of 
this standard deviation the data matrices were essentially ran- 
dom. 

To examine the effect of this extreme case 20 data sets of com- 
pletely random data were generated. 


Results 


With the error standard deviation being small, the data pro- 
duced a fit to the Rasch model with a probability of one. With 
increases in the error standard deviation fit to the Rasch model 
rapidly decreased to around a probability of .5 (Table 1). For 
the 20 sets of completely random data the average probability 
of goodness of fit was 414, S.D. = .178, which was not signifi- 
cantly different from a probability of fit of .500. 


Discussion 

The nature of the probabilistic model needs to be examined here. 
In that Rasch model is a probabilistic model, distributions of prob- 
abilities of item and person parameters are assumed and produced. 
If there is no variability in the observed scores from the expected 
scores then the expected probabilities of the model are not met. 
Thus this probabilistic model is not appropriate for the data. 
This is what occurs when the chi-square probability of fit reaches 
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TABLE 1 
Mean* Chi-square Probabilities Jor Varying Error Deviations 
from the Perfect Guttman Scale 


Error Deviation Mean Chi-square Prob. 
8 .986 р 


16 .864 
24 .691 
32 .581 
40 .628 
48 .427 
56 .428 
64 .580 
128 .582 
160 480 
320 .565 
Random Data** .414 


* Eight data sets were generated at each level of error deviation, these means 
differed at a .001 level of Significance. 
** This mean was based on 20 sets of random data. 


one. This may be better seen with an example. If six coins were 
repeatedly tossed you would not expect to get exactly three heads 
and three tails on each toss. If this were the case the probability dis- 
tribution of these events would not be met. 

In the case of a perfect Guttman scale no variability is allowed. 
This is seen in the definition of reproducibility. Reproducibility is 
the necessary condition for a perfect Guttman scale, where from a 
Person’s total score his response pattern to the set of items can 
be exactly determined. In this sense the Guttman model is a 
deterministic model contradicting the assumptions of the Rasch 
model. The sought after probability for the chi-square test of 
goodness of fit is one half. Great deviation either way from one 
half represents a lack of fit to the Rasch model. 

In the case of completely random data and data with high error 
standard deviation probability of fit near .50 was completely unex- 
Pected—what do such data mean? Wright’s Law School Ad- 
mission Test study (Wright, 1968) used data that had zero fit to 
the model which seems to be the more common occurrence with 
Teal data. In examining the original assumptions of the Rasch 
model this persistance of good fit for random data can be ex- 
plained. 

One of the assumptions on which Rasch bases his model is that 
all responses, given the item and person parameters, are stochasti- 


926 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


cally independent (Rasch, 1966). This term, stochastically inde- 
pendent, can be translated to “local independence.” Though nei- 
ther Rasch nor Wright uses this term in his writing, it was used 
at the 1969 AERA presession on Rasch scaling. The term “local 
independence” is also used to describe a basic assumption of 
Birnbaum’s model (Panchapakesan, 1969; Lord and Novick, 
1968). Birnbaum’s model is similar to Rasch’s model but with 
the additional parameter of item discrimination. The assumption 
of local independence provides that “at a fixed point X the prob- 
ability for joint occurrence are products of the separate probabili- 
ties (Lazarsfeld, 1960, р. 85)." In other words, “those examined 
at a given ability level who answer a given item correctly are no 
more likely to answer other items correctly than are those ex- 
aminees at the same ability level who answer the given item 
incorrectly (Lord, 1966, p. 25).” 

This can explain why completely random data show good fit 

to the model. It represents the case of "local" data, i.e., of sub- 

г jects with the same ability level and all items at the difficulty 
level measuring that ability. In this case the item characteristic 
curve for each item would coincide and thus not meet the criterion 
of a perfect Guttman scale. 

This illustrates another advantage of the Rasch model over the 
Guttman model. The ideal and most precise estimates of a per- 
son’s ability are made when the easiness of the items match the 
subject’s ability, i.e., with repeated measurement of the person's 
ability. With the Guttman model a person's ability is only mea- 
sured by a single item or within an interval of only two items. 
The precision of measurement is determined by how fine а discrim- 
ination can be made between these two items. If the items exactly 
meet his ability level then you would expect a 50-50 chance of 
the person passing each item. In this case the perfectness of the 
Guttman scale would be lacking. 


Conclusion 


Due to the Rasch procedure of truncation a perfect Guttman 
scale is not of а form that can be examined using the Rasch 
‘analysis. A perfect or near perfect Guttman scale does not meet 
the assumption of the Rasch model that it is a probabilistic 
model. Random data, though, represents the case where all per- 


NICHOLAS Е. BRINK 927 


sons have the same ability and all items are measuring that abil- 
_ ity, thus such data produces good fit to the Rasch model. A perfect 
Guttman scale does not allow for repeated measurement of a 
particular level of ability and thus does not possess the precision 
. that may be possessed by a Rasch scale. 
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METHOD OF OBTAINING THE INDEX OF 
DISCRIMINATION FOR ITEM SELECTION AND 
SELECTED TEST CHARACTERISTICS: A 
COMPARATIVE STUDY? 


LOYDE W. HALES 
Ohio University 


More than 60 different methods for computing indices of dis- 
crimination have been proposed. The majority of the methods 
suggested have been correlational techniques for expressing the . 
relationship between & score on ап item and the total test score. 
However, many approaches to item analysis have been used which 
are based on the differences in the percentages or numbers of cases 
in different groups and not dependent on the assumptions under- 
lying согге]а опа] techniques. In this paper three methods which 
have been suggested for use by classroom teachers will be com- 
pared (Adams, 1964; Davis, 1964; and Ebel, 1965). 

Flanagan (1939) devised & chart that shows product-moment 
correlation coefficients for various proportions of success in the 
upper and lower 27% of the criterion group. This index is similar 
in advantages and limitations to the biserial coefficient with the 
following exceptions: it is relatively easy to obtain, № is more 
dependent on the assumption that the regression of item score on 
test score is linear, and, for equal Ns, it is more reliable. 

Flanagan (1939) stated that this index provides an estimate of 
the produet-moment correlation which is unbiased by difficulty; 
ie, that this index is independent of the difficulty of the item. 
тй, unless the proportions used in obtaining these coeffi- 
cients are corrected for chance, Davis (1946) has shown that the 


— 
1 The author is grateful to Dr. Dale Scannell for advice in developing and 
conducting this study. 
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indices of difficulty and the indices of discrimination have low 
positive correlation. Also, the actual discrimination reflected by 
any one value of тъз is not the same for all levels of difficulty. As 
has been shown by Findley (1956), when the discriminatory 
power of the item is held constant, the value of тыз increases as 
one proceeds in either direction from the 50% level of item dif- 
ficulty. Finally, the difference in discrimination between coeffi- 
cients of .50 and .55 is not the same as the difference in discrimi- 
nation between coefficients of .80 and .85. 
For each value of т in Flanagan’s table, Davis (1949) obtained 
its corresponding Fisher 2 value. Then he multiplied each 2- 
value by the constant 60.241, producing an index of discrimination 
with a range from 0 to 100. Thus, the Davis index, a linear trans- 
formation of the Fisher z, is based on the assumptions of the 
Flanagan coefficient, and has most of the advantages and dis- 
advantages of that coefficient with the following exceptions: it has 
an interval scale of values and all indices based on samples of the 
same size will have the same error of measurement. Davis rec- 
ommended that the proportions used be corrected for chance before 
calculating the Davis index for items. Thus, if only the magnitude 
of the item validity indices is considered, the items selected from 
Flanagan’s estimates of r (using corrected-for-chance propor- 
tions) should also be selected by the Dayis approach. f 
Because of the limitations of Flanagan's r, Findley (1956) sug- 
gested the use of the index of discrimination developed by John- 
son (1951). Net D is based on the difference between the propor- 
tions of success on an item in the 27% scoring highest on the test 
and the 27% scoring lowest on the test. Net D is an unbiased index 
of the effectiveness of the item in discriminating between the upper 
group and the lower group. 
Net D is considered by many to reflect both item difficulty and 
item discriminatory power. The opponents of net D often express 
a desire to have independent indices of difficulty and discrimina- 
tion for each item. Nevertheless, since the maximum number of 
correct discriminations an item can make is directly related to the 
difficulty of the item and since net D is directly proportional to the 
net proportion of correct discriminations made by the item, it should 
be anticipated that net D will tend to select items of 50% diffi- 
culty, but that this tendency only reflects the relationship be- 
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tween difficulty and the maximum number of correct discrimi- 
nations an item can make, 

In a study by Engelhart (1965), various indices of discrimination 
(tetrachorie coefficients, phi coefficients, biserial coefficients, point 
biserial coefficients, Davis discrimination indices, and net D indices) 
were obtained for the items of two alternate forms of а test, and the 
intercorrelations among these indices for each form was calculated. 
The intercorrelations were quite high, ranging from 841 to 992. 
When critical values of the indices of discrimination were used to 
evaluate items, all indices agreed on 77% of the items. Where net D 
disagreed with the majority of the other indices, items tended to be 
quite easy or quite difficult. He concluded that net D (with median 
split or upper and lower one-third comparison) was quite effective 
in identifying items which should be revised or eliminated. 

It is the purpose of this study to determine the relative value of 
three item validation methods which may be employed in the selec- 
` tion of items for inclusion in a test: Flanagan’s r, Flanagan’s r com- 
puted from proportions which have been corrected for chance success 
and having corrected indices of difficulty falling within the range 
0.15-0.75, inclusive, and net D. In order to maximize the differences 
which could occur between tests developed by these methods, factors 
which also should be considered when constructing a test (e.g., ob- 
taining an appropriate distribution of indices of difficulty, obtaining 
ап appropriate mean test score, and conforming to the table of 
Specifications) were disregarded. 


Procedure 


In the construction of “Test 1: Social Studies” of the Tests of 
Academic Progress (1964), an item pool of 601 items was formed 
from items on eleven 55-item tryout tests which had been admin- 
istered to a total of 1501 tenth-grade students and 1390 eleventh- 
grade students. The number of students taking a single test ranged 
from 116 to 152. Four items were excluded because of incomplete 
data. For each item at each grade level, the proportion of all students 
who answered the item correctly, the proportion answering the item 
Correctly in the highest 27% group as determined by total test score, 
and the proportion answering the item correctly in the lowest 27% 
8700р comprised the data available. 

The data needed to correct for omissions were not available; 
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hence, the indices calculated were not corrected for omissions. How- 
ever, since the experimental tryout tests were “power” tests, it would 
appear that omissions were more likely to be the result of a lack of 
knowledge rather than a result of a lack of time. Also, items occur- 
ring in the last six item positions on the experimental tryout tests 
were selected for this study no more often than chance expectation, 
suggesting that item position did not appreciably influence the dis- 
eriminatory power of these items. 

Based on item data for tenth-, eleventh-, and pooled tenth- and 
eleventh-grade students, three net D (D) and three Flanagan’s r 
(r) indices of discrimination were computed for each item in the 
item pool. Using corrected-for-chance-success item data for tenth-, 
eleventh-, and pooled tenth- and eleventh-grade students, a Flana- 
gan’s т (т) index of discrimination was obtained for each grade 
level on each item when the appropriate corrected-for-chance-success 
index of difficulty fell within the range 0.15-0.75; inclusive. Solely on 
the basis of the magnitude of D, one test was constructed for each 
grade level (tenth, eleventh, pooled tenth and eleventh). Solely on 
the basis of the magnitude of т, one test was constructed for each 
grade level. On the basis of the magnitude of the т, values of those 
items which have indices of difficulty, corrected for chance success, 
falling within the range 0.15-0.75, inclusive, one test was constructed 
for each of the aforementioned grade levels. With but one position 
reversal, items common to all tests of a grade level appeared in the 
same relative position on each test and items common to two tests 
of a grade level appeared in the same relative position on both tests. 
The first nine items and the last 11 items on each test of a grade 
level were common items. Each test contained 50 items. 

The tests were administered by teachers to all students present in 
tenth-grade English classes and eleventh-grade social studies classes 
on the examination day—a total of 776 students from a senior high 
school. The examination papers for 35 students were eliminated, 13 
because the students were officially classified at a grade level inap- 
propriate to the examination. The remaining papers were eliminated 
from appropriate tests in order that all tests of a grade level would 
have the same number of papers. Thus, the sample for each test de- 
signed for tenth-grade students contained 83 students and each sam- 

ple for the remaining tests contained 82 students. 

Six different test forms, with common directions, were admin- 
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istered within each classroom during a given class period. Each of 
the three tenth-grade tests was given to approximately two-ninths 
of the students in each tenth-grade classroom, with each of the com- 
bined tenth- and eleventh-grade tests given to approximately one- 
ninth of these students. Each of the three eleventh-grade tests was 
given to approximately two-ninths of the students in each eleventh- 
grade classroom, with each of the combined tests given to one-ninth 
of the students. Within a classroom, students were randomly as- 
signed to test form. 


Results 


Any two tests of a grade level were considered to be separate mea- 
sures if the number of identical items selected for both tests was less 
than 90%. As can be seen in Table 1, the greatest overlap between 
any two tests of a grade level was 76%; the average overlap between 
the tests was 62%. 

For each pair of tests, within each grade level, the indices of diffi- 
culty of the items selected for the first test of a pair but not for the 
second test were compared with the indices of difficulty of the items 
selected for the second test but not the first. The indices of difficulty 
of items unique to the tests constructed by each item selection tech- 
nique, when contrasted with the items unique to tests constructed by 
each of the two remaining item selection techniques, are shown in 
Table 2. When comparing D and r, tests, 80% of the indices of diffi- 
culty of the unique items on the D tests fell in the range 0.50-0.69, 
whereas 89% of the indices of difficulty of items unique to the re 
tests fell in the range 0.30-0.49. When comparing D and r tests, 82% 
of the indices of difficulty of items unique to the D tests fell in the 
range 0.40-0.59, whereas 70% of the unique items on the r tests fell 
in the range 0.70-0.89, but an additional 20% fell in the range 0.20- 


TABLE 1 
The Extent to Which Each Test of Each Grade Level Overlapped with Each 
of the Other Tests of Their Respective Grade Levels 


Per Cent of Test Overlap 
Grade Level D&r. D&r r&r 
Tenth Grade 66 72 62 
Eleventh Grade 68 64 54 


Combined Grades 54 76 44 
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TABLE 2 
Frequency Distribution of the Difficulty Indices of the Unique Items in Three Sets of Tests 


10th Grade Tests 


D-r 
4 
3 
1 
2 
1 
11 
4 
1 
Note— The indices of difficulty were not corrected for chance success, 


LOYDE W. HALES 935 


TABLE 3 
The Means and Standard Deviations of the Tests Constructed 
by Three Item Selection Techniques 

Test Means Standard Deviations 
Grade Level D Te г D Te r 
Grade 10 25.54 22.96 26.77 10.52 9.45 9.41 
Grade 11 26.33 23.46 27.26 11.73 10.81 10.74 
Grade 10/11 27.49 21.26 28.10 11.07 10.23 11.99 


0.29. When comparing the т, and т tests, 94% of the unique items on 
the r, tests had indices of difficulty in the range 0.30-0.49, whereas 
79% of the unique items on the r tests had indices of difficulty in the 
range 0.50-0.89, with an additional 13% in the range 0.20-0.29. 

The means and standard deviations of the scores obtained in the 
administration of these tests are shown in Table 3. A two-tailed t- 
test of the differences between each pair of tests at each grade level 
is shown in Table 4. At each grade level, the r, test was significantly 
more difficult than the r test, with the D test falling between the 
other two in difficulty. 

The odd-even (stepped-up for total test length) and the Kuder- 
Richardson 20 coefficients of reliability for each test are shown in 
Table 5. The Fisher z transformation for each coefficient was ob- 
tained. The null hypothesis of no difference between the Fisher 2 
transformations for each pair of tests at each grade level (for Kuder- 
Richardson 20 and odd-even reliability coefficients) was retained at 
the 0.05 level of confidence. No significant differences were found. 


TABLE 4 


Tests of the Significance of the Differences between the Means of Tests 
Constructed by Three Item Selection Techniques 


Difference 1-Тезі of the 
Grade Level Tests between Means Differences 
Grade 10 D-r 2.58 1.6509 
Grade 10 D-r —1.23 —0.7883 
Grade 10 Te —r —3.81 —2.5847* 
Grade 11 D-r. 2.87 1.6173 
Grade 11 D-r —0.93 —0.5247 
Grade 11 т. = т —3.79 —2.2406* 
Grade 10/11 D-r 6.23 v 3.7202* 
Grade 10/11 D-r —0.61 —0.8808. 
Grade 10/11 Te =r —6.84 —3.9054 
о с ЕЕ у o cm 


* p < .05. 
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TABLE 5 


Odd-Even and Kuder-Richardson 20 Coefficients of Reliability for the Tests, 
at Each Grade Level, Constructed by Three Item Selection Techniques 


Odd-Even Reliabilities K-R 20 Reliabilities 
Grade Level D Te r D Ts T 
Grade 10 .923 .895 .882 .904 .881 .889 
Grade 11 .949 .939 .934 .925 .915 .919 
Grade 10/11 .926 .929 .931 .917 .898 .934 
PERRO EMAS iE KARER AE AM ees ди 
Conclusions 


Three item validation methods were investigated in this study: 
net D, Flanagan’s r, and Flanagan’s т, (based on item analysis data 
which has been corrected for chance success). The latter technique 
had the added criterion of corrected indices of difficulty within the 
range 0.15-0.75. These item validation techniques were used to con- 
struct nine tests, one test by each method for each of three groups 
(tenth-, eleventh-, and pooled tenth- and eleventh grade). The av- 
erage overlap between tests of a grade level was 62%. 

At each grade level, the r test mean was significantly higher than 
the r, test mean, with the D test mean falling in between. For the 
tests at each grade level, the Kuder-Richardson 20 and the odd-even 
coefficients of correlation for the tests did not differ significantly 
from each other. The range in values was 0.88-0.95. 

The conclusion to be drawn from the results of this study is that, 
in regards to the test statistics considered, as an index of discrimi- 
nation to be used in item selection in test construction, net D is as 
good as Flanagan’s r and Flanagan’s r,. Because net D may be ob- 
tained much more rapidly than either of the other two techniques, it 
would appear that net D should be an appropriate index of discrim- 
ination for classroom teachers to use, in conjunction with the index 
of difficulty, in the selection of items for inclusion on a test. 
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TEST RELIABILITY AND THE KUDER-RICHARDSON 
FORMULAS: 
DERIVATION FROM PROBABILITY THEORY 


DONALD W. ZIMMERMAN 
Carleton University 


Tue Kuder-Richardson formulas, the КЕ 20 and the KR 21, have 
been used widely to estimate the reliability coefficient from a single 
administration of a test (Kuder and Richardson, 1937). The appeal 
of the idea of finding reliability from item statistics which are avail- 
able after one testing, together with the computational simplicity of 
the formulas, probably accounts for their popularity. 

Although a great deal of attention has been devoted over a pe- 
riod of years to the estimation of reliability from item statisties 
(Jackson and Ferguson, 1941; Guttman, 1945; Gulliksen, 1950; 
Cronbach, 1951; Lyerly, 1958; Novick and Lewis, 1967, and others), 
there are still gaps in the mathematical derivation of the Kuder- 
Richardson results. The main purpose of this paper is to fill some of 
these gaps, using language consistent with modern probability the- 
огу (see, for example, Feller, 1968; Thomasian, 1969). This ap- 
proach, it is hoped, will also lead to a better understanding of con- 
ditions under which the formulas are applicable to test data. 

A test score is regarded as a sum of scores on items which are 
either “correct” or “incorrect.” Subtest scores which are continuous 
random variables can also be considered, and the structure of the 
model is essentially the same in that case. In the present paper the 
former case is stressed, since we are concerned with another sort of 
generality, in which assumptions about independence of item scores 
(“experimental independence”), correlations between certain scores, 
and random sampling procedures, are not made. Initially, there are 
no restrictions of this kind on the probability distributions of item 
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scores; later, special cases which arise when restrictions are placed 
on certain distributions are examined. 

Derivation of these formulas directly from the axioms of prob- 
ability theory has not been explored, but there are several advan- 
tages to be gained by that approach. The main difficulty in deriva- 
tion of the Kuder-Richardson results has been the statement of nec- 
essary and sufficient conditions under which one or the other of the 
formulas equals a reliability coefficient. Conditions stated by Kuder 
and Richardson originally (1937) and by other investigators (Jack- 
son and Ferguson, 1941; Gulliksen, 1950; Novick and Lewis, 1967) 
are either sufficient or necessary and sufficient under a model in which 
item scores of each individual are independent. But if item scores 
are dependent, a situation which is quite plausible in test procedures, 
these conditions are neither necessary nor sufficient. 

The method in the present paper leads to conditions for both the 
KR 20 and the KR 21 formulas which apply in more general cases. 
It clarifies the situation in which independence of item scores, or 
“experimental independence,” is not assumed. Also, it leads to a sim- 
ple interpretation in terms of additivity in a matrix of probabilities, 
which is easily related to the notions of "ability" and “item diffi- 
culty.” 


Notation 


The symbols 0, Ф, and A denote sets, arbitrary elements of which 
are a € 0,8 € 4, and А € A. Probabilities are written pr and Pr, 
sometimes with subscripts. The functions ри: Q — R and pr: Ф X 
0 — R are discrete probability densities defined on a sample space 
Q and on a product space Ф X Q, respectively, while Pr, and Pr 
are the corresponding probability measures. The letter P, without 
subscripts or superscripts, is reserved for a function defined on a 
product set A X , which is introduced in a later section. Expecta- 
tions, variances, and covariances are denoted by E, а", and COV. 


Observed Score Random Variables 


A finite set, Ф = (8, Bs, *** , Bx}, represents individuals, and a 
sample space, Q = {æn а», +++ , as}, represents outcomes of a test 
procedure when applied to every member of Ф. This means that if 
measures were obtained on all individuals, any composite outcome 
would correspond to exactly one point а € 9, A discrete probability 
space (Q, Рл), with density pri: Q — R is constructed in accordance 
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with the usual axioms. The probability measure Pr, is defined on 
the collection of all subsets of 2. 

Consider first a collection of random variables, Ху, 8 € Ф, defined 
on the sample space Q and indexed by the set Ф. For each 8 Є Ф 
ап observed score resulting from a test procedure is a random variable, 
Xy: Q — R. The function X; induces а density on Р, with prx,(u) = 
Prı(Xs = и) = Pri{a: Х,(о) = u}, for each u € В. 

Consider also a function from the Cartesian product of Ф and Q 
into the real numbers, X: Ф X Я — R, defined by X(8, o) = Х»(а), 
where pr(8, а) = 1/К рг, (а), for each (8, а) €  X Q. The random 
variable X induces a density on В, with ргх(и) = Pr(X = v) = 
Pr{ (в, а): X(8, а) = u}, for each и € Р. In this construction the Хр, 
which represent observed scores of individuals, are defined on the 
same sample space Q and can be highly dependent. Since the value 
of X at (8, o) is the same as the value of X, at а, the random variable 
X reflects total variation in scores, or variation with respect to both 
individuals and outcomes. 

Expectations and variances can be defined as follows: For each 
вех, 


ЕХ, = 2 Хи) рт, (ar) 


апа 
o°X, = E(X, — EX,)? = EX, — (EX). 
Also, 
K 8 
ЕХ = > У X(8;, а,)рт(В,, а) 
апа 


X = E(X — EX)? = EX? — (EX). 

Consider now a function EX*: 6— В, which assigns to each В Е Ф 
the expectation of the corresponding observed score random variable 
and a function c^X*: Ф— В, which assigns to each 8 € Ф the variance 
of the corresponding observed score random variable. That is, the 
function EX* is defined by EX*(8) = ЕХ, and the function o°X* is 
defined by о2Х*(8) = о*Х». Means and variances of the densities 
induced on R by these functions are as follows: 


E(X*) = ИК Ў, ВХ") 
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ук X [È оте] 
7-1 та 


È È XO, адрф„ а) 


iji r= 


| 


= ЕХ. 
Similarly, 


E(@’X*) = 1/K È ere) 
= 1K E [ото — (EX) ] 


= E È Xn, др, a) = VK Ў EXON 


i=l r=1 
= EX? — Е(ЕХ*), 
and 
о (EX*) = E(EX*)? — [Е(ЕХ*) № 
= E(EX*)* — (ЕХ). 
Consequently, 
cX = Е(с*Х*) + с*(ЕХ*). (1) 
To summarize, we have a discrete probability space (0, Рт), a 
product space (@ X Q, Pr), an indexed collection of random vari- 
ables, Хр, В € Ф, defined on Q, a random variable X, defined on 
Ф X Q, and functions EX* and c^X*, defined оп Ф. Equation (1) 
gives a relationship between o^X, EX*, and c^X* which holds in 


general, that is, for arbitrary probability spaces and for arbitrary 
random variables X5 whose expectations and variances exist. 
The Reliability Coefficient 

Let Xp: Q X 8 — В and Xy: Q X 9 — R be two independent, 
identically distributed random variables defined on the Cartesian 
product of two copies of the original sample space @, so that for each 
ВЕ Ф, EX, = ЕХу and ^X, = гХу. Let X: 6 X 0 X Q0 Р, 
defined by X(8, a, o^) = X,(a, 0), and Х $ X Q XQ ¬+ R, defined 
by X'(8, o, а) = Хиа, a’), for each (8, а, a’) € & X 0 X Q, be 
identically distributed (but not necessarily independent) random 
variables, with EX = EX’ and ^X = c^X'. That is, from a prob- 
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ability space (Q, Pr;) we construct a product space (Q X Q, Pr’) 
with probabilities assigned by a product rule, so that the coordinate 
variables Хз and Ху are independent, and in turn a product space 
(ФХ Q X Q, Pr), as before. 
A product-moment correlation can be obtained from the joint 
density induced on R* by (X, X’). Since 
COV (X, X" 
sx, х = V GOD, о) 
where 
COV (X, X") = Е(ХХ') — (EX)(EX’) 
= ELE(X,X,’)] — [ЕЕХЦЕ(ЕХ»)] 
= EKEXj(EX,)) — [Е(ЕХЦЕ(ЕХ,))] 
= o (EX*) = c'(Ex'*), 
we arrive at the result 


их, x) = GE) -  _ FT), G) 


Either (2) or (3) can be designated a reliability coefficient. To an 
arbitrary probability space (Q, Рт), together with an indexed 
collection of random variables Xs, В € ©, whose expectations and 
variances exist, there corresponds in accordance with the above 
definitions exactly one real number p(X, X^), with 0 < p(X, X’) < 1. 


Reliability of Composite Tests 


Consider a finite set containing М test items, A = (A, As, *** wh 

a finite set Ф containing К individuals, as before, and a collection of 

NK subsets of Q, indexed by A X Ф. For à € A, ВЕ Ф, and (A, В) € 

A X Ф, the subset As represents the outcomes of a test procedure 

where individual 8 obtains item A correct. Define а collection of NK 
indicators, Ds: 2 — В, by 

ће) = | ke doy 
0 if a Ay, 


for each a € Q. An individual's total test score is now regarded as 
а sum of item scores, 


N 
Xs = 2 Та 
for each 8 Е &. 
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Hence, E; = Prı(hs = 1) = Pre: Igle) = 1} = Рт. (А), 
and а" = ЕГ,в(1 — ED). Define also a collection of N indicators, 
I: $ X29 R, by 1, (8, а) = Dela), for each (8, а) € Ф X Q and for 
each Х Е A. According to the same derivation in the last section, 
oI, = E(o7I,*) + o7(EI,*), where о*Т,* and ЕЛ“ are defined аз 
o’X* and ЕХ* were defined above. 

Consider now a function P: A X Ф — R, defined by P(A, В) = 
Els = Рт, (Ів = 1); that is, P assigns to each (A, 8) € A X Ф the 
probability that individual 8 obtains item à correct. Consider also 
a collection of functions Ps: A — В, defined by Р,(\) = P(A, 8) = 
EL, for each € A, and a collection of functions P}: Ф — R, defined 
by Р,(8) = PQ, 8) = ЕЁ = ED*(8), for each В € Ф. The following 
symbols will refer to means and variances: 

For each В Е Ф, 


N 
ЕР, = 1/N У Ps) = EX,/N, 


and 


N 
ФР, = ИМ У (РА) — (ЕР,)?. 
For each \ € A, 


K 
EP, = 1/K > P6), 


and 
K 
ФР, = ИК 2 (Р(В]“ — (EP. 
Also, 
У К 
ЕР = 1/NK 2 27 PO, 8) = EX/N 
апа 


N к 
oP = 1/NK > > [РФ 821" — (ЕР). 


Let EP,*: Ф — R be a function which assigns to each 8 Е Ф the 
mean of P, and o*P,*: Ф — В a function which assigns to each 
B € & the variance of Ps. That is, EP;*(8) = ЕР», for each 8 C Ф, 
and similarly for c*P,*. Functions EP,*: A — В and “Ре А > Е 


"Y 
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are defined symmetrically. The variance of ЕР,“ and the mean 
of о*Рз* can be defined: 


o°(EP;*) = 1/K > EP G2] — (ЕР): 
апа 


N K K 
EP) = ИМК D У [POB — МК >, (ЕР). 
=з. ja -1 
It follows that 
oP = E(P") + о (ЕР;“) 
and from expressions symmetrical in the А and В subscripts that 
oP = E(o’P,*) + с (EP,*). 

To summarize the interpretations of the above expressions: Jys is 
an individual's item score, where the subscript refers to items and 
the subscript 8 refers to individuals, the random variable X; is an 
individual's total test score, and the function P assigns to each 
(A, 8) € А X Ф the probability that individual 8 obtains item А 
correct, which is the same as the expectation of Гв. The functions 
P, and P, assign probabilities to each Х € A, for fixed В € Ф, and 
to each В for fixed А, respectively. The functions ЁР;* and о°Р,* 
assign means and variances of P, to each В € Ф, and the functions 
ЕР,“ and c^P,* assign means and variances of Р, to each à Є A. 
In turn, ЕР,“ and ЕР,“ have variances c'(EP5*) and о (EP,*), and 
o°P,* and о?Р,* have means E(c^P5*) and E(o°P,*). 

Since, for each 8 Є Ф, 


N 
eX, = DJ hus + 22 COV (Фа Dns), 
where ХМ, An Є A, 
K N K N i^ 
#(7Х“) = ук| è D Ећи — > Zen] 
i=l del j=l i= 
K 
+ ук X 2: 00V (Би, has) 
$71 itm 
= МЕР — EP) — FP] 


+ 1/K » EOW в, ha) 0 
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A general expression for the reliability coefficient can be found by 
substituting (4) in (3). To an arbitrary probability space (Q, Рл), 
together with an indexed collection of indicators, Га, (A, 8) € A X Ф, 
and their sums Ху, В Є Ф, there corresponds, according to these 
constructions, exactly one real number, p(X, X^), with 0 € p(X, X") 
< 1. But in order to obtain the Kuder-Richardson formulas, it is 
necessary to introduce restrictions, so that the covariance term in 
equation (4) vanishes. 
Derivation of Kuder-Richardson Formulas 
Associate a sample space ©, with each А Є A, and let O = 9, X 
O, X +++ X My, where pri(ai, оз, *** , an) = рт, (а) pri” (оз) '' 
pri" (ay), for each (о, өз, +++ , ay) € Q. Define Ig: 9, — R, the 
N 
indicators of subsets As of each О, by X, = У) Ху, as before. 
i=l 


It follows that, for each 8 Є Ф, the N indicators Г, are independent, 
while no restrictions are placed on the К indicators Га corresponding 
to a given \ € A. This means that, for each 8 € Ф, Рл, (А, 2) Рт (Ав) 
= Ри: (Ам CY Ав), for all N, X, € A. But for a given А € A, 
Pri (As) Рт (Ала) does not necessarily equal Рл, (Ау, N Ам), for 
Bi, 8. € Ф. 

Hence, for each В Є Ф, COV (Да, Та) = 0, for all A, Є A, 


eX, == 2 [Е1,.(1 — Eh,s)], and 
E(o?X*) = МЕР — EP) — ФР] 
= PAN ЕЮ ONES — МЕЧА). 


Therefore, 
its EX(N — EX) , NE(?P,*) , No?(EP;*) 
X, X) = 1 — Se 8 c 
aa ait Nex a ox + rn. 


Provided N > 2, substituting р(Х, X")/N for [Nc (EP4*)/c? X] and 
solving for p(X, X’), yields 


ORITUR EX(N — E. N°E(o?P,* 
> | МЯ D| m ie. © 
for N = 2, 3, +++ 


Since, for each Х € A, 
ol, = Е(о?1,*)"+- PEL”) 
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= EPR") + PPa 
it follows that 


N 
E@X*) = E» ође 
iml 
N 
= > [Е(о°1,,*)] 
N 
= Yh, — NE(P,*), 
i=l 
and since 


Е(?Р,*) = Е(0?Р,*) + о°(ЕР,*) — o°(EP,*), 
it follows that 


Е(с*Х*) = T (^h) — МЕ(г Р;“) — Ме (ЕР;“) + Ne (EP,*). 


Consequently, 


DA 2р,*) — g?(EP,* 
A(X, X) = М | Bee ада 


с (N — Dec X у 


(6) 


for = 2,3, <<< 
From (6) it is apparent that 


N 
N > oly 

KX,X)-x-—ili-^sx |, 
where oJ, = Pr(I, = 1)[1 — Pr(, = 1)], that is, that the reliability 
coefficient equals the KR 20, if and only if Е(о"Ра“) = c (EP,*), or 
equivalently, if and only if E(c^P,*) = о (ЕРр“). From (5) it follows 
that 

N EX(N — ЕЮ] 

их, 2) = у - P 
that is, that the reliability coefficient equals the KR 21, if and 
only if E(s^P,*) = 0. iub. 
These equalities can be interpreted in terms of sdditivity P the 

N X К matrix of probabilities, P(A, 8). The equality E(c Ра") T 
c (EP,*) holds if there exist functions Fı: A — R, Fa: Ф ¬» R, F,*: 
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A X $ В, and F,*: A X ® ¬4 В, where F,*(Q, В) = Р, (А) and 
where F2*(\, 8) = F.(8), for each (A, 8) € A X $, such that P = 
Г,* + F,*. That is, P(A, В) = Fı*(\, В) + F2*(\, В), for each (A, В) € 
A X $. Otherwise expressed, the functions Ps: A — В differ by 
constants; for all 8, 8' € &, Ра) = Р, (А) + k, for some constant К. 
Equivalently, the functions P}: Ф — В differ by constants; for all 
X M € A, Р,(8) = Р,'(В) + с, for some constant c. The equality 
E(c^P,*) = 0 holds if in addition each function Ру: A — R Ва 
constant function, or equivalently if Pj, = Pj, =... = Pyy. This 
means that for each individual 8 the total test score X, is a sum of 
N independent, identically distributed indicators. 


Examples 


Tables 1 and 2 illustrate simple test procedures, where a sample 
space © contains only four points. For convenience we write 2 = 
(ал, оз, аз, о}, Ф = {bu Bo}, and A = (А, Ma}. The procedure, which 
consists of each of two individuals responding to each of two test 
items, can have four possible outcomes. The tables show the function 
P, the values of the random variables Га and X, at each a C О, 
the values of the random variables I, and X at each (8, a) € ФХ 9, 
the induced densities, and the means and variances needed to 
calculate p(X, X’), using equation (3). 

The values of each oJ, and of aX can be substituted in the KR 20 
formula, and the values of EX and о°Х can be substituted in the 
KR 21 formula. In table 1 the KR 20 formula equals p(X, X^), but 
the KR 21 does not; in table 2 both formulas equal p(X, X’). In 
these tables the random variables Jup, and в, are independent, 
and the random variables Тв, and Jng, are independent. In each 
case, therefore, the values of the function P at each (à, 8) € A X Ф 
determine whether or not either the KR 20 formula or the KR 21 
formula equals p(X, X"). 


Continuous Random Variables 
If Ху: Q — В is а continuous random variable, we write Px,(U) = 
Prı(Xs € U) = Pro: Хе) Є U}, for each interval U С R, such 
that X5 ^ (U) is an event in the sample space Q. If subtest scores are 


continuous random variables, Ig: © — №, and if X, = S Thin 


for each 8 Є Ф, we define a function EI**: A X Ф— R by EÍ HO, B) 
= Ely, for each (A, 8) € A X Ф. For this case derivation of the KR 20 
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TABLE 1 
` Reliability Determined from Sample Space, KR 20 Applies 
2 = [o os, as, o4] Axa, = Ф (М, 8;) Ро, Bi) 
Ф = {8,8%} Arg, = [аъ аз} (№, б) 0 
A = [x X] Arp = {ay оз] Qu, В) H 
Ам» = Q Qu, Ві) + 
№ В) 1 
а, рт) Лаа) _ Лаа) _ Љуша) Бо)  Xplar) Xp, (ar) 
а 3 0 1 1 1 1 2 
оз i 0 1 0 1 0 2 
а i 0 0 1 1 1 1 
[7] i 0 0 0 1 0 1 
у (85 ат) pr(B,, ar) Ti ar) Th, ar) ХВ ar) 
(Bı, ол) $ 0 1 1 
(61, оз) + 0 0 
(Вл, аз) $ 0 1 1 
(Bı, o4) i 0 0 0 
(Ba, ол) $ 1 1 2 
(Вз, оз) + 1 1 2 
(Ba, оз) $ 0 1 1 
(83, a4) + 0 1 1 
u рт, (и) и prn,(u) ш — ртхр, (и) ш раја) ш prx(u) 
0 1 0 1 0 i 1 i 07072 
1 i 1 і 1 i 2 $ 1 | 


ЕХ», = Xp = + El, = + ol, = № 
Ес. o Oe ИЕ 3 N 
EX =1 SX =} 
E(o?X*) = 1 
o(EX*) = } 
| РСК, X) = $ 
KR 20 = $ = p(X, X’) . 
о 21 = 0 8 9 Ы АЦА 


formula is essentially unchanged, but the KR 21 formula applies 
only if the Т, are indicators. 

| In the continuous case there is no restriction on the rango of the 
random variables Га, but other restrictions as to independence and 
values of the function EI** in the N X К matrix remain as in the 
above derivation. Although finite cases have been stressed in the 
Present paper, models having essentially the same structure also 
apply to denumerable and nondenumerable sample spaces Q, as 

; же] as to denumerable and nondenumerable Ф. 


TABLE 2 
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Reliability Determined from Sample Space, КЕ 20 and КЕ 21 Apply 


0 = far an су в} Ар = Ф Qs Bi) POs Bs) 
Ф = (8,6) Ang, = (аз e] Qu, 8) 0 
А = (мм) А.в, = Фф Qu, 82) + 
Anp, = [as aa} (№, Ві) 0 
(№, Вз) à 
priler) Г.в, (ат) Гв, (а) Г.(о) Буве) Хат) Хр, (a) 
$ 0 0 0 0 0 0 
i 0 0 0 1 0 1 
i 0 1 0 0 0 T 
3 0 1 0 1 0 2 
(Bj, а) pr(B5 ar) Т,,(Вь or) Т,(Вь а) X(B;, ar) 
(Bı ол) $ 0 0 0 
(Bı аз) $ 0 0 0 
(Bı, оз) $ 0 0 0 
(Bı, a) $ 0 0 0 
(Bs, са) $ 0 0 0 
(Bx, оз) $ 0 1 1 
(Bs оз) i 1 0 1 
(Bs, o) i 1 1 2 
ш рт. (и) и рт,(и) ш  рихи (и) ш их, (и) u prx(u) 
0 H 0 + 0 1 0 0 $ 
Ve MET УНЕ 2 3 
2 i 2 i 
ЕХ», = 0 Kg, = 0 El, = + ol, = № 
ЕК ЕН в Кт Ји 
Sedo den ај 
Кох“) = à 
eNEX*) = i 
p(X, Х') = + 


formulas equal the reliability coefficient. In deriving the formulas 
two kinds of restrictions were made. One provided for independence 
of individuals’ item scores Ig, and another was related to the values 
of the function P at each (A, 8) Е A Ф. 

The function Ри: A — В can be regarded as an assignment of “dif- 


KR 20 = } = p(X, Х') 
KR 21 = $ = 0(X, X’) 
Discussion 
There are several alternative ways of interpreting the above 
results and of stating conditions under which the KR 20 and KR 21 
ficulty” to items, and the function Fa: Ф > В as an assignment of | 


— 
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“ability” to individuals. If for each individual £ the item scores s 
are independent, and if, further, ability and item difficulty 
are additive, the KR 20 formula equals the reliability coefficient. If 
for each individual the item scores are independent, if abil- 
ity and item difficulty are additive, and if item difficulty is a con- 
stant, the KR 21 formula equals the reliability coefficient. For the 
KR 21 it is not sufficient that ЕД, the mean of item scores over in- 
dividuals, is the same for all А. The above condition implies that 
for each В, Рав is the same for all А, in other words, that every func- 
tion Pg: А — В is a constant function. 

Another way of stating these conditions is as follows. The KR 20 
formula applies if for each individual the scores on the N items of 
a test are independent indicators ("Poisson trials,” or generalized 
Bernoulli trials), with restrictions on the probabilities P(A, 8) as 
given above. The KR 21 formula applies if for each individual the 
scores оп the № items of a test are independent, identically distrib- 
uted indicators (Bernoulli trials). 

The additivity conditions are both necessary and sufficient under 
a model in which each individual’s item scores are independent, 
which is true if Q = 9, X Q; X +++ X Oy, if 


рт. (ол, 0з, *** , од) = рт (ол)рт (ог) +++ рт (ay), 
for each (e, a2, «++ , ay) € Q, and if the indicator Г, is defined on 0 
as above. More generally, however, the KR 20 formula equals the 
reliability coefficient if and only if 
K 
N[E(c* Ps*) a д*(ЕР,*)] = ИК 2 2 COV (Dier Deo), 
and the KR 21 formula equals the reliability coefficient if and only 
if 
NE(o?P;*) = 1/K 2: 2; cov (Disi в). 
It is possible to construct sample spaces in which 
E(o°Ps*) — o°(EP,*) = 0, 
or in which E(o*P,*) = 0, even though 1/K >; COV (Бао Ds) 


>< 0. For example, in the space 0 = (ап, аз, оз, au}, where Ф = {61, Ba} 
and A = {ày №}, where pi(a) = $, for each а € 9, and where 
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Axe = 0, Ans, = (а, ов}, Arne, = 9, and Ав, = (оз, оз}, во that 
Е(с°Р;*) = 0, the additivity conditions are satisfied, but neither 
the KR 20 nor the KR 21 formula equals p(X, X’), since 


Pri(Ay,a,)Pr(Anp.) ғ Ри\(Азв, O Ам): 


Also, it is possible to construct sample spaces in which E (а Рр“) — 
о (EP,*) >= 0, or in which E(c^Ps*) # 0, although the equality con- 
taining the covariance term is satisfied. For example, in the space 
Я = (e, a2, оз, as}, where Ф = {bu Bo} and A = (X, №}, where 
pila) = 2, for each а € 9, and where As, = {a3}, Aus, = 0, 
Ама, = (ол, аз, оз}, and Ал, = 0, the KR 21 formula equals p(X, X"), 
even though Р(№, 8) = +, Р(\ b) = 4, PQs, B) = i, and 
РО», Вз) = 3. 

The additivity conditions on the function Р can also be expressed 
in terms of item scores and total test scores. That is, the KR 20 
formula applies if P( 8) = EI, + EX;/N — EX/N, for all 
A, 8) € A X $, and the KR 21 formula applies if P(A, В) = ЕХ»/М, 
for all (\, В) € A X Ф. 

Summary 


The Kuder-Richardson formulas 20 and 21 were derived from a 
model based on probability theory, in which all assumptions required 
for equality of one or the other of the formulas and the reliability 
coefficient of a test were made explicit. Scores on composite tests 
were described by a sample space 0, a set Ф representing individuals, 
and a set A representing items. A collection of random variables was 
defined оп Q and indexed by Ф. For each 8 Е Ф a random variable 
Хв: Q — R represented an individual’s observed score. A random 
variable X: 6 X Q ¬ В, defined by X(8, а) = Хв(а), for each (8, a) Є 
Ф X Q, represented variability in scores over all individuals. Expec- 
tations and variances ЕХ,, EX, а Хр, and c?X were defined. A 
function EX*: Ф — В was defined by EX*(@) = EX;, for each 
B € Ф, and a function c^X*: Ф — В was defined by о*Х*(8) = o°Xp. 

Independent, identically distributed random variables Хр and Xs’ 
were defined on the Cartesian product of two copies of the original 
sample space ©. Identically distributed random variables X and X' 
were defined on Ф Х Q X Q. It was proved that the correlation 
between X and X’, which is regarded as а reliability coefficient, 
equals the ratio of the variance of ЕХ* to the variance of X. 

Given a set of test items A, an individual’s total score was defined 
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N 
as a sum of N indicators, Xe = >> D, А Е A. A function 
i=l 
P: A X ® ¬ В was defined by P(A, 8) = EL, = Рг (ба = 1). The 
reliability coefficient of a composite test was expressed in terms of 
means and variances of item scores and was related to the function P 
and to the partial functions Ру: Ф — R, defined by P,(8) = P(A, 8), 
for each В Е Ф, and Ру: А — R, defined by Р(Х) = P(A, В), for 
each A Е A. 

HO = Q; XQX--X Qy, where Priles уб, а) = 
pri’ (œ) pri" (оз) +++ pri? (ом), for each (æ, an +++ , ay) € Q, the 
indicators Га, defined as coordinate variables on 0, are independent, 
and the KR 20 and KR 21 formulas can be derived by placing 
further restrictions on the probabilities Р(\, В). A necessary and 
sufficient condition for the KR 20 is additivity of ability and item 
difficulty, where item difficulty is identified with a function F,: A — R 
and ability with a function Рз: Ф — В. A necessary and sufficient 
condition for the KR 21 is additivity of ability and item difficulty, 
together with constancy of item difficulty. This means that the KR 20 
formula applies if P(A, 8) = EI, + EX,/N — EX/N, for all (A, 8) Е 
A X &, that is, if the probability that a given individual obtains a 
given item correct, or the expected value of the item score, is the sum 
of “item difficulty” and a fraction of the total test score ("ability"), 
less a constant which depends on the mean total test score. Further, 
the KR 21 formula applies if P(A, 8) = EX;/N, for all (A, 8) € A X Ф. 

As an alternative statement, a sufficient condition for the KR 20 
is that each individual’s item scores are independent indicators 
(“Poisson trials,” or generalized Bernoulli trials), provided the ad- 
ditivity condition is satisfied, while a sufficient condition for the KR 
21 is that each individual’s item scores are independent, identically 
distributed indicators (Bernoulli trials). If item scores are not 
independent, the additivity conditions are neither necessary nor 
sufficient. Conditions were also found for the more general case 
in which there are nonzero covariances between an individual's item 
scores. 
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A practical problem resulting when one applies а nonmetric scal- 
ing algorithm (Shepard, 1962; Kruskal, 1964b; McGee, 1966; Gutt- 
man, 1968) to a matrix of similarity measures is the indeterminancy 
of the resulting coordinates. It is always possible to apply a transla- 
tion, central dilation, and rotation to such configurations without 
affecting the degree of goodness of fit of the solution to the original 
similarity measures. Consequently, it is usually desirable to apply 
such transformations to the initial output of nonmetric scaling pro- 
grams in order to obtain a more readily interpretable solution. The 
purpose of this article is to present a few illustrative examples of 
how one might fit nonmetric scaling solutions to hypothesized targets 
in order to facilitate interpretation. It maybe should be pointed out 
that the major difference between the present program and those 
which fit factor analytic solutions to hypothesized targets is the 
additional freedom to choose the translation vector (1.е., the location 
of the origin) with scaling solutions. 
ара агне tor time for this project was supported by National 
Aeronautics and Space Administration Grant NsG-398 to the Computer Sci- 
ence Center of the University of Maryland and part was made available 
through the facilities of the Computer Science Centers of the Ohio State Uni- 
versity and the University of New Mexico. This research was also supported 
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The computer program for carrying out the computations of fitting 
the scaling solution to the target is based upon the mathematical 
least squares derivation given by Schénemann and Carroll (1970). 
A copy of the program is available upon request from the senior 
author. The program determines the transformation matrix used to 
rotate the configuration, the dilation factor, and the translation 
vector. The scaling solution in its new spacial orientation which best 
fits the target in a least squares sense is given followed by the resid- 
ual matrix, the difference between the target matrix and the fitted 
scaling solution. In addition two measures of goodness of fit of the 
fitted scaling solution to the target matrix are given. One is the mean 
squared element of the residual matrix and the other is the normal- 
ized symmetric error measure defined by Schönemann and Carroll 

(1970). 

Proponents of scaling techniques frequently express the opinion 
that the utility of the techniques is much greater when testing pre- 
existing hypotheses about the underlying structure of the stimuli 
than when using the techniques for exploratory expeditions. The 
arguments are familiar ones to those familiar with factor analysis. 
For instance, Thurstone (1947) has said one should assemble tests 
for a factor analysis on the basis of some a priori hypothesis while 
Guttman (1954) has always emphasized that factor analysis should 
be used more for testing a priori theories than for developing a pos- 
teriori theories. The following examples illustrate how one could test 
some a priori theories an experimenter might hypothesize about the 
underlying structure of a set of objects to be scaled. 

A correlation matrix of 12 variates was generated from a theoret- 
ical factor structure having five factors with four distinct clusters 
(Carroll, 1969). One of the factors was a general factor and was a 
linear combination of the other four factors. This correlation matrix 
was scaled using a modification of Kruskal’s program which was 
written by Robert J. Wherry. The resulting coordinates are given in 
Table 1. Identification of the dimensions would be difficult with the 
solution in its present form. However if an experimenter had reason 
to believe either by inspection of the correlation matrix or by some 
other means that there were four distinct clusters, then he could 
create a target matrix of ones and zeroes where each column could 
represent a cluster with loadings of “one” for the variables in the 
cluster and loadings of “zero” for the variables not in the cluster. 
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TABLE 1 
Coordinates from Common Factor Space Scale Analysis of Correlation Matriz 


Variate Dimension 
1 2 3 
1 —0.5190 1.4800 —0.8200 
2 —0.4180 1.3160 —0.7230 
3 —0.3790 1.1690 —0.6300 
4 1.8630 —0.0520 —1.4770 
5 1.7350 — 0.0520 —1.3250 
6 1.6550 —0.0120 —1.1920 
7 —0.5750 —1.4290 —0.7920 
8 —0.4520 —1.8200 —0.7250 
9 —0.3890 —1.2350 —0.6360 
10 0.9300 —0.0660 1.2890 
11 0.8720 —0.0550 1.1440 
12 0.7980 —0.0340 1.0570 


Table 2 gives the target matrix which correctly identifies the clusters 
and also gives the fitted scaling solution when the coordinates in 
Table 1 were fitted to the target of Table 2. Since the program re- 
quires both the scaling solution and target matrix to have the same 
order, the original three dimensional scaling solution was augmented 
by a column of zeroes. It can be seen that the variables clustered as . 
they should and that the fitted scaling solution would have greatly 
facilitated any attempts of interpretation. 

It should be pointed out that the placement of variables into clus- 
ters need not always be perfect since the program provides a least 
squares fit. For instance, the coordinates of Table 1 were fitted to the 
target matrix given in Table 3. The resulting fitted sealing solution 
is also shown in Table 3. Once again the variables clustered as they 
should even though the hypothesized clusters had four of the 12 
variables incorrectly placed. The hypothesized structure of Table 3 
would have been rejected in favor of the hypothesized structure of 
Table 2. The fitted solution in Table 3 does not have nice “clean” 
factors in the sense that the variables not within the main cluster 
have small but not negligible coordinates. We attempted to pull out 
the general factor and obtain a hierarchical solution (Schmid and 
Leiman, 1957) by placing one variable from each cluster onto an 
additional fifth factor. The target matrix and fitted solution are 
shown in Table 4. The hierarchical structure of the variables was 
uncovered with a general factor and four “clean” factors. In some 
instances an investigator might only be able to identify a single vari- 
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TABLE 6 


Initial Coordinates from Scaling Analysis of Similarity Measures 
between 12 Vowels as Given by Singh and Woods 


Dimension 
Vowel 1 2 3 
1 —0.9550 —0.7660 —0.2130 
= —0.3560 —0.4120 0.0670 
e —0.6340 —0.1850 —0.1610 
= —0.3080 —0.5170 —0.4440 
R 0.0380 —0.3190 —1.0120 
о. 0.5930 0.0590 —0.8520 
5 0.6510 0.4830 —0.1670 
o 0.5720 1.2180 0.0500 
~ —0.0460 0.2910 0.9450 
ц —0.3260 0.8980 0.7170 
^ 0.4710 0.2540 0.3660 
T 0.2960 —1.0050 0.7050 


able for each of the different clusters. In such cases fitting the scal- 
ing solution to a target where only the marker variables have load- 
ings of “one” might still reveal the underlying structure of the entire 
set of variables. Table 5, for example, shows the fitted scaling solu- 
tion when the coordinates of Table 1 were fitted to the target matrix 
of Table 5. The correct clustering of the variables is clearly revealed 
and interpretation of dimensions would be greatly enhanced. 

The final illustration describes an actual application of the pro- 


TABLE 7 
Initial Solution Fitted to Model with Tenseness as a Dimension 
Target Matrix Fitted Solution 
Dimension Dimension 
Tongue Tongue Tongue Tongue 
Vowel Height Advancement Tenseness Vowel Height Advancement Tensen' 
DIE 0. 1.0000 + 0.2641  —0.1383 0.6134 
т 0. 0. 0. X 0.293 0.2238 0.4520 
е 0.500 0. 1.0000 е 0.4064 0.1828 0.6419 
& 0.5000 0. 0. & 0.5139 0.0668 0.4010 
RM 1.0000 0. 0. A 0.8681 0.0898 0.3063 
а. 1.0000 0.5000 0. а. 0.9500 0.4197 0.1722 
2 1.0000 1.0000 0. > 0.7289 0.7715 0.2709 
o 0.5000 1.0000 1.0000 o 0.7702 1.0908 0.5209 
у 0. 1.0000 0. м 0.0730 0.7973 0.5221 
u 0. 1.0000 1.0000 ц 0.2675 0.9032 0.8238 
^ 0.5000 0.5000 0. ^ 0.4068 0.7704 0.2833 
^ 0.5000 0.5000 1.0000 c —0.0394 0.3223 — — 0.008: < 
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TABLE 8 


Initial Solution Fitted to Model with Retroflexion as a Dimension 


Target Matrix Fitted Solution 
Dimension Dimension 
Tongue Tongue Retro- Tongue Tongue Retro- 
Vowel Height Advancement flexion Vowel Height Advancement flexion 
a. 0 0. 0. a 0.1538 —0.1887 0.0656 
т 0. 0. 0. т 0.2871 0.2081 0.1716 
е 0.5000 0. 0. e 0.2619 0.1547 —0.0660 
& 0.5000 0. 0. & 0.5123 0.0312 0.0767 
Я 1.0000 0. 0. R 0.8816 0.0466 —0.0751 
a 1.0000 0.5000 0. о, 1.0446 0.4058 —0.0267 
‚ э 1.0000 1.0000 0. > 0.7915 0.7933 0.0191 
o 0.5000 1.0000 0. o 0.6670 1.1324 —0.2383 
we 0. 1.0000 0. ~w 0.0591 0.8372 0.2312 
и 0. 1.0000 0. u 0.0305 0.9378 —0.1618 
^ 0.5000 0.5000 0. ^ 0.5051 0.8028 0.2197 
З` 0.5000 0.5000 1.0000 3 0.3053 0.3387 0.7840 


gram to test two opposing experimental theories? Kruskal’s non- 
metric scaling program was used to analyze similarity judgments 
between 12 American English vowels. The obtained coordinates are 
shown in Table 6. The investigators proposed two possible models 
to explain the similarity measures. The first model consisted of the 
three dimensions of tongue height, tongue advancement, and tense- 
ness while the second model substituted retroflexion for the tenseness 
dimension. Since tongue height and tongue advancement are con- 
sidered ternary dimensions, target coordinates of 1.0, 0.5, and 0.0 
were used for these dimensions while only target coordinates of 1.0 
and 0.0 were used for the tenseness and retroflexion dimensions. The 
target matrix and best fitting scaling solution for the model with 
tenseness as a dimension is shown in Table 7 while the target ma- 
trix and best fitting scaling solution for the model with retroflexion 
as a dimension is shown in Table 8. It is readily apparent that the 
data provided a poor fit to the model with tenseness as а dimension 
but resulted in a good fit to the model with retroflexion as a dimen- 
sion. The normalized symmetric error measures were .0891 for the 
model with retroflexion as a dimension as opposed to a large .1290 for 


the model with tenseness as a dimension. 


2 The authors would like to thank Sadamand Singh and David Woods for 
permission to reproduce part of their findings here. 
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FINDING POINTS OF VIEW IN JUDGMENT DATA 


ROGER PENNELL: 
Educational Testing Service 


Ir certainly must be argued that the availability of computers to 
experimenters in the behavioral sciences provides the capability for 
much finer and much more thorough data analyses. With the myriad 
of multivariate procedures which are more or less routinely imple- 
mented on our computers an investigator finds himself confronted 
with a large number of tacks he might take to evaluate his experi- 
mental hypothesis. Often, however, the investigator shortchanges 
himself by utilizing the most exotie of procedures. The case in point 
is the model by Tucker and Messick (1963), henceforth TM, to an- 
alyze a data matrix of p judgments by N subjects into components 
accounting for subject variance and components accounting for 
judgment variance. Whereas before, one could only wonder about 
individual differences that were known to exist in a sample of sub- 
jects, one now had a procedure to isolate the components of these 
individual differences. Whereas before, one analyzed the mean 
judgment (or every subject’s set of judgments separately) the 
sample could be partitioned into groups giving more or less homo- 
geneous responses. 

The thesis propounded in this paper is, first, that investigators 
tend to have misconceptions concerning the model, and second 
(not necessarily as a result of the first), investigators tend to mis- 
use the model. In order to operate on common ground let us digress 
and indicate the exact model. 

The most common utilization of the TM model occurs when an 
investigator has obtained n(n — 1)/2 = p judgments on all pairs 


. 1 The author is indebted to Robert Weber, Cornell University, for perform- 
ing the necessary computer programing. 
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of п stimuli for each of N subjects. This generates the р Х N data 
matrix X assumed to have the following form: 


X = UGW (1) 


where U contains the column-wise eigenvectors of XX’, W contains 
the row-wise eigenvectors of X’X, and G is a diagonal matrix con- 
taining the positive square roots of the eigenvalues of either XX’ 
or X’X. 

Due to a theorem by Eckart and Young (1936) we know that for 
any arbitrary rank r we necessarily produce a least squares approxi- 
mation to X by 


X, = U,G,W,, (2) 


where Х, is least squares, rank-r approximation to X, U, contains 
the first r columns of U, W, the first r rows of W and G, the first 
т rows and columns of С. The experimenter usually chooses r by one 
or another subjective procedure aimed at finding the minimum “sig- 
nificant” number of components needed in the model. At this point 
Tucker and Messick state that the elements in U, represent pro- 
jections of stimulus pairs on unit length principal vectors of X, the 
elements of W represent projections of people on the unit length 
principal vectors of X and that, further, each column of U repre- 
sents a set of distance measures for the set of p judgments. We can 
now, for instance, absorb G, into U, and W, and produce а trans- 
formation on W,, say Т, that is more psychologically pleasing than 
the principal vector orientation and still preserve the form of the 
model as 


X, = (U.G, "T )(TG,"*W,) = YZ. (3) 


Perhaps the most interesting notion that TM develop is that of 
an idealized individual. Since the columns of Z represent projec- 
tions of people on т rotated dimensions, it is clear that we may ap- 
pend any number of additional columns (representing imagined or 
idealized individuals), say of them, on to the end of Z and after 
premultiplying by Y, our matrix X, will be Р by М + m where 
the last columns represent judgments made by idealized indi- 
viduals. As such these judgments may be analyzed by one or an- 
other multidimensional scaling routine to obtain the underlying 
structure of the stimuli as they appear to the idealized individuals. 
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We shall proceed in three phases: to show two common misuses 
of the model; to use a set of artificial data to show that incorrect 
interpretations are a result of these misuses; and to illustrate the 
proper approach to analyzing such data. 


Misuses of the Model 


TM state that one should expect the first component of U to be 
highly correlated with the mean judgment, which brings us to our 
first point. Knowing that the first component of U essentially re- 
presents a set of mean judgments, some investigators apply the 
TM point-of-view routine with no intention of searching for in- 
dividual differences in their data. With some phrase like “the pat- 
tern of eigenroots was inspected and it was decided that one com- 
ponent was sufficient to . . . ,” they could analyze the distance from 
only the first component and simultaneously report the utilization 
of a fancy multivariate procedure. It is argued that this procedure 
is wrong for three reasons: (a) The rationale for selecting only one 
component is usually related to the very large size of the first 
eigenroot. It was clearly stated in the TM paper that we should ex- 
pect the first eigenroot to be large (due to choosing not to eliminate 
means variance by row-centering) and that this state of affairs is 
totally independent of whether or not individual differences exist. 
(b) Only in the most uninteresting of cases (certainly null) is it 
tenable to assert that there exist no consistent, identifiable char- 
acteristics of subjects which produce intersubject variance. (c) 
Granted that we have rightly or wrongly decided to eliminate 
considerations of individual differences, why use the elements of 
an eigenvector to represent distance measures when we can put our 
feet on the ground with actual means with known sampling prop- 
erties? 

The second area of conceptual difficulty centers around the no- 
tion that the decomposition in (2) provides us with individual 
points of view, or individual sets of distance measures which can 
each be analyzed to obtain representative stimuli configurations. 
No matter whether one considers U, or Y, the column-wise ele- 
ments are not in general all positive and therefore do not even 
possess the elementary property of distances: non-negativeness. 
Some would argue that a set of distances both positive and nega- 
tive simply constitutes an “additive constant” problem; however, 
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this author has had little interpretive success upon scaling such 
numbers based on this premise. 

A helpful heuristic in conceptualizing the subject space is to 
consider it made up of a large number of directions. As we move 
along some particular direction some facet of stimulus relation- 
ships changes in a consistent fashion. As an example, subjects 
closer to the origin in a particular direction might perceive stim- 
ulus 7 and j to be closer together than subjects farther from the 
origin in this same direction. Were we to pick a point in the space, 
multiply through its coordinates to get an idealized set of distances 
and find that some of these distances were negative, we should be 
satisfied that we have chosen an idealized subject that we could 
never, even theoretically, observe. This is so because he perceives 
two or more stimuli as being so close together that their distance 
is negative. It seems at best fatuous to analyze distances from a 
subject who is theoretically not observable. Furthermore, taking, 
say, the ith column of U, as a set of distance measures is equiva- 
lent to utilizing the one-dimensional centroid (mean) of the corre- 
sponding ith subject component from W,. That is to say, this is 
one way of idealizing the ith component of subject variance. But, 
indeed, this is the height of absurdity unless there exist subjects 
with high scores on the ith component of W, and essentially zero 
scores on all other components. If this is not the case, we are im- 
plicitly embracing a model which says that the way in which sub- 
jects make judgments about stimuli can be viewed as a multidi- 
mensional process, and that we are interested in one dimension of 
that process even though it produces judgments not at all like the 
Judgments actually made. For this reason the statement made by 
TM: “These stimulus-pair projections, when . . . rotated to oreinta- 
tions possibly more appropriate psychologically than the principal- 
axes position, will constitute measures of distance between pairs 
of stimuli” (Tucker and Messick, 1963, p. 29), is simply not 
worded strongly enough, i.e., we must isolate dimensions, by 
means of rotation, which pass through clusters of real subjects, and, 
as such, generate an essentially “simple structure” space for sub- 
jects.* Without this we embrace the somewhat bizarre model alluded 
to above. We shall delay this point until the example which fol- 


2 This is not to say that one may not eliminate the rotation problem alto- 
gether by choosing interesting points corresponding to idealized individuals. 
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lows and acknowledge that Cliff (1968) has cogently argued a 
rather similar point. 


Example 


As an example we shall consider a fictitious set of data in which 
rather extreme points of view actually exist. We shall generate 
points of view by concocting four ways in which a set of two- 
dimensional stimuli might be “conceptualized” by hypothetical 
subjects. Figure 1a represents & standard conceptualization, 1b 
and 1c represent subjects that use either the first dimension or the 
second, but not both, 1d represents a uniform contraction of the 
la space. This example is slightly extreme, but it is not hard to 
imagine a population of subjects that differ in their perceptions of 
а set of stimuli along the lines of Figure 1. The four sets of inter- 
point distances corresponding to the four points of view about the 
stimuli were computed, and an additional sample of four subjects 
was generated for each of the points of view by adding random 
noise distributed as N (0.5) to each “true” interpoint distance. This 
generates the matrix X as p — 28 and N — 20 (five subjects for 
each point of view). X was decomposed by (1) and (2) taking 
r — 4, The elements of G were 1070.79, 455.34, 10.57, 7.12, 6.61, 
4.21, 3.28, 2.89, 2.66, 2.49, 2.28, 1.90, 1.78, 1.57, 1.33, 1.17, .87, .77, 
67, .56. If these roots were derived from exploratory data, one 
would surely not take more than three components; on the other 
hand, one should not conclude that there is only one point of view 
merely because there is one enormously large root. Presumably 
there appear to be only three points of view because the first and 
last population points of view are so similar. 

What happens if we decide to use the elements of the first eigen- 
vector of U, as measures of the interpoint distances of the eight 
points? We can get a feeling for what kind of configuration we are 
going to obtain by considering the correlation of this vector with 
the four sets of true interpoint distances obtained from Figure 1. 
The correlations in order are .9737, 8852, 3825, and .9367; the 
multiple correlation between the four sets of distances and the 
first vector is .9999. It seems clear that the set of jnterpoint dis- 
tances we are considering sealing (the first eigenvector of U,) is 
exactly a linear combination of the distances we should be con- 
cerned with (the true distances) but is imperfectly correlated with 


970 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT | 


6 5.8 Џ 423 


Fig. la Fig. Ib 


aura 


= то 


j Fig.lc А Fig. Id 
Figure 1, Four hypothetical “conceptualizations” about 8 stimuli in 2-space. 


imagination and represents no empirical state of affairs whatso- 
ever. 

Results such as obtained from our first eigenvector of U, make 
evident the folly of the “normative” approach to oh in the 
behavioral sciences. Indeed, what good is it to “predict and con- 
trol” behavior of a normed nonexistent entity? Clearly we can 
ке. the “first eigenvector” approach to resolving the data ma- 


| 
any one of them, i.e., the first eigenvector of U, is a figment of our 
| 
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What of the second, third and fourth eigenvectors of U,, is there 
any hope of finding a correspondence with the original set of dis- 
tances? Table 1 presents a rectangular correlation matrix where 
rows represent the last three eigenvectors of U, and columns re- 
present the four sets of interpoint distances from Figure 1. Here it 
looks like the second vector is a bipolar representation of the 
second and third viewpoints; however, the other viewpoints are 
not evident. In any case we should expect a virtually unconditional 
identification since we started from concocted data, and the results 
in Table 1 do not afford such identification. In no case can we hope 
to recover configurations of stimuli like those in Figure 1, even 
though we know them to be present, from the last three vectors of 
U,. 


Admissible Sets of Distances 


In our case neither rules of thumb nor orthogonal rotations will 
yield an admissible set of distances—a set which correlates almost 
perfectly with the original set, and which, therefore, affords the 
possibility of recovering the exact configurations of stimuli. We 
have to simply look at the data (W,) and observe that there are 
four clusters of points (subjects) lying on obliquely related axes. 
The problem can be attacked in either of two ways. We can, as 
Cliff (1968) suggests, merely read off the centroids of those four 
clusters, array each centroid as a column in, say, D, and produce 


X,* = U,G,D, (4) 


where X,* represents judgments of distance made by four idealized 
individuals. Note that this essentially averaging process (in com- 
puting the centroids) is not subject to the same philosophical cri- 
ticism as using a mean vector to represent the judgments of all the 
subjects. Here we have presumably isolated the components of in- 
dividual differences, and, as well, groups of subjects that con- 


TABLE 1 
Correlations between Eigenvectors of U, and Original Distances, D 
Di D: D; Dı 
U; —.2215 —.7288 7826 —.2793 
и, — .0467 — .0913 —.0961 2096 
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sistently respond alike. We can, therefore, argue that using cen- 
troids is a very natural way to deal with the measurement error 
that we expect. 

A second way to attack the problem, based on our knowledge of 
which subjects belong to which groups, is to produce a pattern 
matrix, P, representing group membership. In this case the matrix 
would be 4 х М and the ith column would contain a 1 in the row 
representing the group to which the ith subject belongs and а 0 in 
all other rows. The transformation, Т, in (3) is the one which 
makes Z look as much like P as possible, namely 


T = PW. (WW) GE”, (5) 


ог 


Т = PWG? 


since W4W4 = I. Note that one obtains the distances corresponding 
to the groups from the matrix Y. 

Using this approach on our artificial data the distances in the 
columns of Y have correlations of .994, .992, 1.00 and .972 with the 
respective original distances. Clearly, if our scaling algorithm is 
sufficiently precise, we can be confident of retrieving the input con- 
figurations. 

The method utilizing the P matrix is possibly the most versatile 
in practice. If the number of groups is large we need not go to the 
trouble to plot subject points and gauge the extent to which they 
cluster, rather we need only gauge the extent of agreement between 
P and Z. The extent to which they agree reflects the extent to 
which we have been able to find a nonrigid rotation of the subject 
axes such that they pass through clusters of actual subjects. Here 
we would be willing to tolerate small negative values in Y as long 
as the fit between Z and P was quite good. 

It should be noted that using this approach we are strictly un- 
able to locate a number of groups, вау g, which is less than т. This 
is the case because what we need is the left-hand inverse of Т, 
which doesn’t exist when g is less than r. In our example the four 
groups fell out rather nicely because they were the four salient com- 
ponents of subject variance and therefore came out as mixtures of 
the first four principal components. The initially appealing idea of 
taking r to be large, perhaps the full set of components, and trying 


bz 


і 


ROGER PENNELL 973 


to find, say, two components representing male judgments and fe- 
male judgments is, for the above reason, doomed to fail. If we take 
only two components, r = 2, and thus ensure g not less than r, we 
are most unlikely to have these two components represent any mix- 
ture of sex variance whatsoever, 1.е., it would be extremely un- 
likely that sex differences would be prominent enough to come out 
as the first two components unless the experimental task was ex- 
plicitly designed to contrast sex differences. 

It should be pointed out that the rather typical problem in these 
types of analyses, especially when the sample of subjects is large, 
is that when trying to plot the subject points in r-dimensional space 
we find one large, irregularly shaped cluster of points. Using the 
rationale developed to this point one clearly proceeds along one of 
two lines: Decide that the individual differences are uninteresting 
or at least unsystematic and therefore compute mean judgments 
and scale those, or take the judgments of these subjects who seem 
to span the cluster of subject points and scale each in turn. One 
thereby determines how internalized representations of the stimuli 
vary as the range of individual differences contained in the sample 
is spanned. 


Summary 


We have tried to argue that simplistic and/or heuristic ap- 
proaches to the TM model are often inadequate. In particular, 
there is apparently little to recommend the utilization of the first 
eigenvector as a set of distance judgments. 
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THE DIMENSIONS IMPLICIT IN PSYCHOLOGICAL 
MASCULINITY-FEMININITY: 


ROUTH N. COFFMAN 
Saint Elizabeths Hospital 
BERNARD I. LEVY 
The George Washington University 


Tur theory and method of arriving at the assessment of psy- 
chological masculinity-femininity (MF) is not yet clear, in spite 
of over half a century of efforts on the part of scholars and re- 
search scientists. Although a number of standardized, self-report, 
MF scales exist (e.g. Terman-Miles Attitude-Interest Analysis 
Blank, The Mf scale of the MMPI, Gough’s Femininity Scale), 
intercorrelations among them, when available, illuminate the 
dilemma of low concurrent validity. 

The need for more reliable groupings of items as an important 
step toward the understanding of the structure of the МЕ dimen- 
sion, or dimensions, has been recognized from the beginning of 
MF test construction. Yet in no instance have the single or multiple 
dimensions of MF tests been clarified. In writing about their test in 
1936, Terman and Miles suggested that perhaps the next step in МЕ 
research should be the devising of highly reliable subtests in as many 
fields of sex differences as is feasible, for the purpose of making 
profile studies of individual subjects possible. 

Little (1949), in what may be an isolated attempt, found that 
the five MF items most highly correlated with the total score for 
the entire 60 item Mf scale of the MMPI were only .45 to .50, in- 
dicating heterogeneity within the scale. 

з Тыв article ds is submitted to the Faculty of the Grad- 
oe бш. pee E George Washington University. The 


authors wish to thank Dr. William R. Reevy for his continuing interest and 
his past help. 
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In an effort to avoid heterogeneity of content within any sub- 
test, Ford and Tyler (1952) grouped all of the items from the 
Terman-Miles Attitude-Interest Analysis Blank MF test (Forms 
A and B) to obtain 14 subscores. Then using factor analysis they at- 
tempted to investigate the possibility of multidimensionality of the 
MF variable. They identified two factors for their male subjects 
and three factors for their female subjects. However, within their 
subgroups of items they mixed item types and item content. They 
offered no objective justification for item homogeneity within sub- 
groups. Consequently, the meaning of the factors is clouded by 
the lack of any objective test of consistency within groups of 
items. 

The position adopted in this study was that the underlying 
structure of what has been thought of as the MF variable is multi- 
dimensional, that a more valid description of MF would thus 
reveal a pattern and not a score. An attempt was made to combine 
the approaches of Little (1949), and of Ford (1952), by examin- 
ing both the intercorrelations between items within groups, and 
the intercorrelations between groups of items, with the hope of 
be one step closer toward a more precise psychometric definition 
of MF. 


Method 
Preparation of the Research Form 


The literature, test catalogues, and old files were searched to 
locate existing MF scales? The scales selected are published and 
standardized, and they are obtainable in test form or are reported 
in the literature. The latest available editions were used. 

There were 384 items extracted from the selected scales, of 
which 81 were repetitions. These latter items were removed. Since 
all items were neither phrased nor scored in the same manner, it 


? The scales located were: Behavioral Inventory, Be j 
(1962 edition), Bernreuter Personality ЕО баен psg td 
dren, California Psychological Inventory (1960 edition), DePauw Adjustment 
Inventory, Edwards Personal Preference Schedule (1953 edition), Gray Mas- 
culinity-Femininity Scale, Guilford Martin Inventory of Factors GAMIN 
(1948 edition), Guilford Zimmerman. Temperament Survey (1949 edition), 
Gough Femininity Scale (. 1952 edition), Krout Personal Preference Scale (1951 
edition), Mt innesota Multiphasic Personality Inventory (1943 edition), Minne- 
воћа, Personality Inventory, Satisfaction Test, Terman-Miles Attitude. Interest 


Analysis Test (1936 edition). Tho: ini : tude | 
are italicized. ве containing the items used in this study 
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was decided to set some standard for congruity. The use of True 
and False, instead of Yes and No, or Like and Dislike, as answer 
choices was decided upon. More items from the original MF 
scales were phrased in the first named manner. Items which were 
in the form of questions were in the minority, and were rephrased 
in the form of statements. The decision to score in the direction 
of masculinity rather than in the direction of femininity was 
also made on the basis of majority rule, and required slight 
changes in rephrasing. Several judges checked all rephrasing, with 
complete agreement that the meaning of the items did not appear 
to be changed. 

The items from the Edwards Preference Schedule presented a 
particular problem. They were designed to estimate heterosexual 
interest and thus can lend weight to a score for either masculinity 
or femininity, depending on the sex of the subject. They were 
included for two reasons: the shortage of other items expressing 
direct sexual interest, and the expectation that they would cor- 
relate more highly with masculinity than with femininity. How- 
ever, as Edwards (1953) had pointed out, items removed from 
the context of his test may be changed in meaning, 

Three judges independently assigned items to categories based 
upon item content. Attempts were made in all cases to allow the 
items to cohere on the basis of item content describing the catego- 
ries. Temporary word labels were thus evolved from the categoriza- 
tion process instead of being created to direct it. The first three 
columns of Table 1 contain the list of categories and the number 


of items assigned to them. 


Supplementary Data 

With the hope of better understanding the relationship between 
the social desirability variable and МЕ, it was decided to include 
the 39 item Edwards Social Desirability Scale (1957). In an 
effort to begin to test the assertion that there may be classes of 
MF items that relate to variables independent of sex differences, 
the Thorndike Vocabulary Test (1942) was included. 


Subjects 


The subjects were 100 men and 100 women associated with an 
urban university. In order to facilitate the generality of scales, 
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a wide range for the demographic variables was sought within the 
restrictions of the university setting. Ages of the 200 subjects 
ranged from 17 to 88 years, with M = 27.31 years for men, 
and M = 26.11 years for women. Education ranged from 12 to 
23 years, with M = 15.06 for men and M = 14.29 for women. 
Thus, on the average the male subjects were approximately a 
year older and had approximately a year more education than 
the women subjects. Thorndike Vocabulary Test IQs varied from 
the Dull Normal range of intelligence through the Very Superior 
range, being skewed in the lower direction for both male and 
female samples. 


Data Analysis 


The model for data analysis was item analysis, and the statistic 
was the biserial correlation. The goal of data analysis was the 
identification of those items which correlate best with the total 
score for each proposed scale. Although this could have been done 
entirely on an iterative basis in a very large scale study, the 
classification of items into hypothetical categories facilitated the 
process in this less than ideal course of action. 

Total splits for each item were determined, i.e., the number of 
subjects answering the items in the direction of scored masculinity 
as opposed to the number answering in the opposite direction. 
Splits were also determined separately for male and for female 
subjects. 

For each of the 12 categories, the total masculinity score for the 
category was obtained with one item omitted. The category score 
was then correlated with the score for the item which had been 
removed. Keeping in mind the possibility that there might be 
items which would distinguish between male and/or between 
female subjects, and not across male and female subjects, this 
was not only done for the entire population but also for each sex 
separately. 

To be retained in a category, each item had to discriminate by 
at least a 180-20 split for total population, and at least a 90-10 
split for male and for female subjects separately. The correlation 
of the item score with total category score had to reach at least 
20 plus or minus, with all correlations in the same direction. 
Biserial correlations of 20 were selected as the cut-off point 
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because there tended to be a gap between correlations of 20 and 
lower correlations. With a population of 200, and a split of 
180-20, a biserial correlation of .20 will be found to be statistically 
significant at the .10 level. 

The second iteration involved only those items which had been 
retained by the first statistical attempt to maximize the discrim- 
inative value of the categories. Again items were to be retained only 
if their scores met the minimum correlation of .20 with the 
total category score. 


Results 


Because of failure to meet the criteria established to define 
minimum standards for internal consistency, 121 (40 per cent) | 
of the 303 MF items in the research form were put aside. Of these 
121 items, 30 items were lost because of failure to meet the criteria 
for splits. 

Table 1 presents the number of items lost from each category 
during item analyses, and the number of items remaining in 


TABLE 1 
Loss of Item from Hypothetical Content Categories During Item Analyses 


——————— 


Items Failing Items Failing 


Category First Analysis Second Analysis 
Number Items 
Number Label of Items Splits ты. ты» Remaining 
umber Labels. РНЕ Е шс + 
1 Adjustment 23 2 10 0 11 
2 Reaction to 25 
Aversive Stimuli 29 1 0 0 
3 Emotional 1 12 
Responsiveness 17 0 4 58 
4  Interests-General 37 1 1 i ? 
5 Interests-Sexual 19 6 6 0 x 
а Interests-Vocational 23 0 1 
Personal Appearance 
3 and Grooming 21 1 5 0 15 
Sociophilia- 
Sociophobia 28 3 14 0 in 
9 Symptoms 28 7 6 1 5 
10  Timidity-Boastfulness 22 0 4 
ll  Values-Cultural 0 2 
Directives 30 8 20 
12 Values—Personal 1 в 
Directives 26 1 16 
Mo ЫШЫ л а > 
Totals 303 30 87 4 182 
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each of the categories after each iteration. With the drop out of 
items as a result of the first iteration, Category 11 was reduced to 
only 2 items and therefore eliminated. 

Table 2 presents the correlations among categories and demo- 
graphic variables. Since the MF literature has repeatedly called 
attention to the close relationship of the hypothesized MF variable 
and the kind of esthetic and cultural interests which are usually 
linked directly to education, and more directly to 10, it was an 
unexpected finding that education and also IQ reached correla- 
tions of above .20 with one category only. The correlations with 
the SD variable and with biological sex were another story. Four 
categories reached correlations above .50 with the scores for the 
Edwards Social Desirability Scale items, and six categories cor- 
related above .50 with biological sex. 

Table 3 presents the intercorrelations among categories. In an 
effort to further study the intercorrelations, to consolidate and 
enlarge the categories, and thereby increase their reliability, in- 
tercorrelations among total scores for each category were sum- 
marized in a factor analysis which is also reported in Table 3. 

In sum, this study ended with 182 of the original 303 items 
from the nine MF scales falling into six relatively homogeneous 
subscales which tentatively define the postulated heterogeneity of 
the item content. 


TABLE 2 
Correlations of Categories with Demographic Variables 
(N = 200) 


I OER a e 


Categories Age Education IQ SD Sex 
1. Adjustment 11b 07 —03 79 16 
2. Reaction to Aversive Stimuli —19 20 02 31 65 
3. Emotional Responsiveness 10 09 —08 68 47 
4. Interests-General —07 11 —18 20 100 
5. Interests-Sexual —20 1 08 —12 56 
6. Interests- Vocational 00 10 -n 22 96 
7. Personal Appearance and 
Grooming 01 24 52 
8. Sociophilia-Sociophobia —09 01 1 Es 17 
9. Symptoms 11 17 —01 з 31 
10. Timidity-Boastfulness -1 16 —09 55 46 
12. Values-Personal Directives —06 10 -14 08 65 


а ~ ОИ АИРИ ت‎ 
* Biserial correlations, 
» Decimals omitted. 
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Discussion 

Throughout much of the literature on personality measurement 
it appears to be generally accepted that one of the most popular 
methods of test construction is one which relies upon some method 
of internal consistency (Ferguson, 1952). In spite of this, at- 
tempts to get at the meaning of the MF personality variable by 
an examination of the internal consistency of the scales used to 
measure this variable have not been rigorous. In testing hypoth- 
eses concerning the probable heterogeneity of the MF scale con- 
tent, and the probable multidimensionality of the MF variable 
measured by this content, research scientists have usually re- 
stricted themselves to studies of the relationships between scales, 
or between subtests of scales. 

The fact that the confirmation of the particular dimensions lo- 
cated here will require further studies does not vitiate the rigor 
with which the original proposition of multidimensionality has 
been demonstrated. MF, as measured by the items used in this 
study, appears to consist of at least six reasonably independent 
dimensions. 

Factor I (Fastidiousness) delineates a scale which combines 
two categories that are clearly correlated with biological sex 
and unrelated to age, education, SD, and 10. The items in Category 
2, which carried the tentative label, “Reaction to Aversive Stim- 
uli,” are correlated with sex and emphasize queasiness and dread. 
The items in Category 7, “Personal Appearance and Grooming,” 
emphasize an aversion to lack of fastidiousness, and an inter- 
est in personal adornment. They were correlated with biological 
sex. Clearly, men and women do differ in their expressed tolerance 
for certain fears and revulsions, with men less fussy about ex- 
ternal appearances and more counter-phobic, and women more 
self-consciously critical and more phobic. Thus we have items 
which men and women do tend to answer in sex-appropriate direc- 
tions; items which could be used to “locate the subject, with a fair 
degree of approximation, in terms of deviation from the mean of 
either sex,” in the manner in which Terman (1936) had originally 
set as the goal for measures of MF. 

Factor II (Attitudes toward Affective Life), which accounts for 
the largest amount of shared variance, consolidates into one scale 
four categories which are all clearly correlated with the social de- 
sirability variable. Two of these categories are also correlated, to 4 
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lesser degree, with biological sex. There appears to be no significant 
discrimination between men and women by items which have to do 
with attitudes and worries about emotional adjustment and symp- 
toms of psychopathology since Category, 1, “Adjustment,” had the 
lowest correlation of all categories with sex, and Category 9, “Зутр- 
toms,” had a very modest correlation with sex of subject. However, 
with Category 3, “Emotional Responsiveness,” correlating with sex 
of subject, and Category 10, “Timidity-Boastfulness,” correlating 
with sex of subject, men and women did more definitely distinguish 
themselves from one another on items conveying a more direct ex- 
pression of anxiety and emotionality. 

Factor III (Need for Social Closeness) has a high loading from 
Category 8 which was tentatively labeled, “Sociophilia-Socio- 
phobia,” and from no other category. This creates a scale in 
which the content emphasizes affiliative needs. This scale is to a 
modest degree related to the SD variable and not signifi- 
cantly related to the other external variables, including sex. A 
study of the item content, and the very low intercorrelations of 
Category 8 with all of the other categories, reveals that the 
items in this scale are tapping what reads like a reasonably 
separate dimension, with more detached and impersonal subjects 
at one pole and subjects who are both more nurturing and more 
dependent, as well as more eager to please, at the opposite pole. 

Factor IV (Interest in Vocations and Avocations) establishes a 
scale by combining Category 4, “Interests—General,” and Cate- 
gory 6, “Interests—Vocational.” This scale replicates the old 
stand-by of most research findings. Preferences and choices for уо- 
cations and avocations, for careers and hobbies, showed almost 
perfect correlation with biological sex, and no significant cor- 
relation with any of the other variables. These sex-role prefer- 
ences appear to be so firmly rooted in cultural traditions as to ap- 
proach taboos. However, the pattern of splits and correlations 
for the items in these categories shows that in at least half 
of the cases they discriminated successfully for total population 
mainly because the splits or correlations differentiated for only 
one sex. Thus, since nearly perfect discrimination between the 
sexes would be of little psychological value, we have within this 
scale at this point in time more meaningful discrimination among 
subjects of the same sex than would at first seem evident. TE, 

Factor V (Punitiveness versus Mercy) has a high loading in 


984 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Category 12, which was tentatively labeled ‘“Values—Personal 
Directives.” This scale located by Factor V is comprised of only 
eight items, and is therefore not very reliable. However, one item 
reached a biserial correlation of .76 and five of the other 
items have correlations above .50. With the content so clearly 
related to a callousness versus tenderness dimension, additional 
items could now readily be added to improve the reliability. The 
items forming this scale correlate more negligibly with the SD 
variable than the items from any other category, while their 
high correlation with biological sex is surpassed only by the in- 
terest items. It is a factor which shares with Factor II the 
ability to single out responses which are specifically sex-appropri- 
ate from responses which are socially appropriate in general. These 
responses differentiate between the sexes by showing men as more 
punitive and vengeful, and women as more merciful but with 
accompanying subjective distress. 

Factor VI (Overt Attitudes toward Heterosexuality) has a 
high loading in Category 5. Since this category was tentatively 
labeled "Interests-Sexual" it seems rational to have expected it 
to correlate highly with biological sex and not with the other 
variables. This expectation was borne out in part in that the 
only significant correlation of this category with the other varia- 
bles is with sex of the subject. 

Thus, the six subscales located in this attempt to arrive at a 
psychometric definition of the MF personality variable point to 
the inadequacy of assessing MF by a single score. They demon- 
strate the multidimensionality of MF for this study, and clarify 
why attempts to correlate total scores for different MF scales have 
so frequently led to disappointing and inconsistent results. To the 
extent that MF scales have differing representations of these six 
subscales (and others not yet defined) high correlations would 
be very unlikely, if in truth the subscales are as independent 
as they appear to be. Although corroborative evidence for the 
particular dimensions of MF located in this study is needed, the 
original proposition of the multidimensionality of MF has been 
demonstrated. 
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А Q VALIDATION OF THE STRUCTURE 
OF SOCIAL ATTITUDES! 


FRED N. KERLINGER 
New York University 


Тне purpose of this study was to test certain implications of a 
structural theory of attitudes using a Q methodological approach. 
The theory has been tested with some success cross-sectionally 
(Kerlinger, 1967a, 1967b, 1970), but its notions also lend them- 
selves to testing in a Q manner. 

The theory and its implications have been discussed elsewhere 
(Kerlinger, 1967b). Consequently only an outline of it is needed 
here. Social attitudes are a subset of the domain of attitudes 
whose referents have shared general relevance to many people in 
economic, educational, religious, ethnic, and other social areas. 
Two basic dimensions or factors, liberalism and conservatism, 
that are relatively orthogonal to each other, and that are defined 
by liberal and conservative attitude referents, underlie social at- 
titudes. A reference is a concept or a category, a set of things 
toward which an attitude is directed: private property, children’s 
needs, civil rights, religion, for example. A criterial referent of an 
attitude is a construct that is significant and relevant to the 
individual; it is the focus of his attitude. The universe of social 
attitudes, like the universe of social attitude referents, has two 
subsets, one liberal and one conservative. To the liberal, referents 
such as civil rights and social reform are criterial, while referents 
like private property and free enterprise are criterial to the con- 
каз te ie ea see! ое 
sity. This support is gratefully acknowledged. I am also grateful to Mrs. Asha 
Paranjpe who administered many of the © sorts and who helped substantially 
with the analysis of the data. 
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servative. The referents criterial to the liberal are in general 
not criterial to the conservative, and vice versa. Exceptions to 
these statements are radicals of the left and the right: to the 
John Bircher, for instance, liberal referents are negatively crite- 
rial. 

The operational implications of the theory in © methodology are 
as follows. One, if “known” liberals and conservatives are asked 
to sort a set of statements or a set of referents representative of 
the domains of attitude statements or attitude referents, factor analy- 
sis of the intercorrelations of the Q sorts should yield two or 
more factors with liberals and conservatives loaded on different 
factors and relatively little bipolarity on any factor. Two, if the | 
Q sorts are structured sorts composed of half liberal and half 
conservative items, analysis of variance of each individual’s Q- 
sort data should yield significant F ratios, with the differences in 
means congruent with the individual predictions. 

Three, no matter how we measure social attitudes, the above | 
predictions should hold. This means that two Q sorts, one con- 
structed with attitude statements and the other with attitude 
referents, should yield basically similar persons factor structures 
and, of course, basically similar analysis of variance results. In 
short, if the attitude theory under test is valid, the same duality 
of structure should be present, no matter what the form of meas- 
urement or the manner of the analysis—other things equal and 
assuming the measurement procedures and the analyses are ap- 
propriate to the problem and the data. 


Method 
The Q Sorts 


Two Q sorts designed to measure social attitudes were used. One 
of these, called Social Attitude Q Sort (SAQ), was constructed 
some years ago for use in a doctoral study (Smith, 1963). It was 
a structured sort with 30 liberal (L) and 30 conservative (C) 
items. The items were also categorized as Political-economic and 
general social, but the present study was not concerned with this 
dimension. Two examples of the items, the first liberal and the 
second conservative, are: 


The Constitution needs to be changed from time to time to 
meet the changing circumstances of our society. 


- 
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If civilization is to survive, there must be a turning back to 
religion. 


If a known conservative sorts the deck using the criterion of ap- 
proval or agreement, the mean of the 30 C items should be sig- 
nificantly greater than the mean of the 30 L items. 

The second Q sort, called Referents Q Sort (REFQ), had 40 
liberal (L) referents (civil rights, social reform) and 40 conserva- 
tive (C) referents (free enterprise, discipline). Using a criterion 
of positive or negative feeling, a liberal subject’s mean of the L 
items should be significantly higher than his mean of the C items. 
The opposite should be true for a conservative subject. The 80 
items were chosen from some 400 referents collected from a number 
of sources: Systematic treatises on conservatism, e.g., Rossiter 
(1962), and liberalism, texts on educational philosophy, e.g., Bru- 
bacher (1962), newspaper editorials, magazine and journal arti- 
cles, and existing attitude scales (Shaw and Wright, 1967). In 
addition, many of the items were written from knowledge and 
experience. Representativeness of the presumed attitude domain 
(economic, educational, etc.) was the main criterion used in 
selecting referents for the © sort. ^ 

REFQ was a structured Q sort of the factorial kind. The first 
dimension, as already indicated, was Attitude: Half the items 
were liberal and half conservative. The second dimension was 
abstractness-specificity. Again, we are not concerned with this 
dimension in the present study. It was included in the Q sort for 
а theoretical-empirical purpose not directly pertinent to the 
study? 


Subjects and Administration of © Sorts 


Thirty-three individuals in New York and California of 
“known” liberal and conservative attitudes—known by the inves- 
tigator or known by others—sorted both decks of cards in quasi- 
normal distributions using the criteria mentioned above. Fifteen 
of the sample were liberals and 18 were conservatives. There were 


2 When REFQ was constructed not enough was known about attitude refer- 
ents, and, к concepts that really belong to the value Чеп Манн 
included: Loyalty, human progress, character, and piety, for мек т Ps = 
ture theoretical and empirical development will have to be a fairly cl “зүп s- 
tinction between attitude referents and value concepts. Fortunately, the 4 ди 
obtained from REFQ were not vitiated too much by the inclusion of the 


value terms. 
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five business people, seven professors, 10 graduate students of edu- 
cation, two nurses, seven housewives, one retired Navy captain, 
and a retired Army general. The data for statistical purposes were 
the numbers assigned to the piles sorted by the subjects (see 
Stephenson, 1953). To estimate repeat reliability, eight of the 33 
subjects sorted REFQ a second time at varying intervals, from 
one month to over a year. (The reliability of SAQ had been estab- 
lished by Smith (1963).) The repeat coefficients of reliability 
ranged from .66 to .91, with an average (via z) of .80. Evidently 
REFQ is a stable measurement instrument. 


Analysis 


The data of each individual on both Q sorts were subjected to a 
factorial analysis of variance of the 2-by-2 kind, but we are con- 
cerned only with the liberal and conservative means and the sta- 
tistical significance of their differences and the coefficients of in- 
traclass correlation associated with such differences. Each success- 
ful prediction was labeled a hit. Although not directly pertinent to 
the basic predictions of the study, this procedure was considered 
necessary to support the validity of the choice of subjects and es- 
pecially the validity of REFQ. 

The responses of all the subjects to each Q sort were intercorre- 
lated and factor analyzed using the principal factors method (Har- 
man, 1967) and varimax rotations (Kaiser, 1958). The highest 78 
in each colunm of the 33-by-33 correlation matrices were used as 
communality estimates. A second method of ascertaining predictive 
success was used with the results of the factor analysis. The number 
of successful predictions, or hits, were counted as follows. It was 
expected that known liberals and conservatives would appear on sep- 
arate rotated factors, and that liberals would appear together on 
a factor or factors, and similarly for conservatives. It was also ex- 
pected that most of the subjects would appear on only two factors, 
one liberal and one conservative. A factor was categorized as lib- 
eral or conservative by the individuals of known attitudinal pre- 

dilections appearing on it. If a known liberal appeared (a loading 
.35 or greater) on a liberal factor, this was counted as a hit—and 
do for а known conservative appearing on a conservative 
actor. 


Ап important aspect of the analysis was the agreement or con- 
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gruence between the results of the two Q sorts. It was expected that 
the factor structures of the two factor analyses would be similar, 
especially with fewer factors (see Peterson, 1965). It was assumed 
that the underlying attitudinal dimensions of liberalism and con- 
servatism would outweigh other sources of variance in both Q 
sorts and that the results from quite different instruments would 
thus be similar. In any case, the rotated factor structures for two, 
three, and four factors were compared using the coefficient of con- 
gruence (Harman, 1967, p. 270). 


Results 


Analysis of Variance 


А predictive success with the analysis of variance results is in- 
dieated by the L mean of a known liberal being significantly 
greater than the C mean, and vice versa for a known conserva- 
tive, as indicated earlier. The numbers of SAQ and REFQ lib- 
eral and conservative hits are reported in the first data line of 
Table 1. Of the 33 comparisons of the SAQ L and C means 22 were 
hits: 15 for liberals (out of 15) but only 7 for conservatives (out 
of 18). The REFQ hits were more consistent: 14 and 14. Evidently 
both ЗАО and REFQ can measure liberal attitudes with a high 
degree of success, but are not as efficient in measuring conserva- 
tive attitudes, The harshness of this latter judgment, however, is 
somewhat softened by two points. One, none of the conservative 
non-hit individuals had higher L than C means, and two, almost 
all the analysis of variance non-hit individuals were hits in the 
factor analyses, as we will see. In general, then, the predictions 


TABLE 1 
Analysis of Variance and Factor Analysis Predictive Success, SAQ and REFQ* 


SAQ REFQ 
L С L с 
Analysis of Variance 15 7 14 14 
i 15 16 14 16 
Factor Analysis, 2 Factors 15 18 14 18 


Factor Analysis, 3 Factors 


* Entries in the table are numbers of predictive hits. Total numbers of L subjecta: 15, C subjects: 


18. L and C indicate liberal and conservative subjects. 
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were successful, but more successful for liberals than for conserva- 
tives. 


Factor Analysis 


The factor analytic results are more compelling. Two, three, 
and four factors were extracted and rotated from both sets of Q 
data. Recall that the most important information we seek must 
bear on the dualism theory claims as to the basic feature of social 
attitude structure: Liberals should be loaded together on a factor 
or factors, and conservatives, similarly, should be loaded together, 
but liberals and conservatives should appear on different fac- 
tors. Very few (less than five, say) of the individuals should have 
substantial negative loadings, though low negative loadings will 
of course appear. 

The evidence is clear. The second and third data lines of Table 
1 indicate very high proportions of hits with both liberals and 
conservatives on both Q sorts. Using a criterion of loading 35 or 
greater, all but one individual, a presumed liberal, were predicted 
accurately in the three-factor solutions. In the two-factor solu- 
tions, there were only five non-hits out of 66 predictions, two con- 
servatives on SAQ and one liberal and two conservatives on 
REFQ. 

Although the predictive evidence is gratifying, in order to bear 
more directly on the structural attitude theory being tested, we 
need to know the pattern of the loadings of the individuals, The 
factor loadings of liberals and conservatives appeared on differ- 


3 Intraclass coefficients of correlation, which indicate the consistency with 
which individuals place the cards (the variance between the L and C means 
relative to the variance within the categories), were caleulated {ог each in- 
dividual. The mean coefficients for liberals, on SAQ and REFQ, respectively, 
were 56 and 43, whereas the mean coefficients for conservatives were 08 and 
т are apparently more consistent than conservatives in their card 
F Although not central to the purposes of the study, the statistical informa- 
tion on significant interaction F ratios may be of theoretical interest. Eleven 
subjects had such significant F ratios with SAQ, and, with liberals, the sig- 
nificance was due to the general social category means being higher than the 
economic-political means, whereas with conservatives it was the opposite. Wi 
REFQ, 14 subjects had significant interaction F ratios: Liberals were for the 
most part high on the liberal-abstract category, whereas conservatives were 
high on the conservative-specific category. Three examples each of liberal- 
abstract and conservative-specific referents are, respectively; Equality, social 
reform, desegregation; property rights, competition, free enterprise. 
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ent factors in the two-, three-, and four-factor solutions, and 

bipolarity was not a prominent feature of any of the rotated 
^ solutions. 

The two-factor solutions of SAQ and REFQ were highly sim- 
ilar. In fact, the agreements between the two Q sorts for the first 
two factors of the two-, three-, and four-factor rotated solutions 
are all high: over .90. The coefficients of congruence between the 
REFQ and SAQ rotated factor vectors are given in Table 2. Only 
liberal individuals appeared on the first factor in all solutions. 
In the three- and four-factor solutions, conservative individuals 
appeared on the second, third, and fourth factors. In other words, 
the first factor in each case was a liberalism factor, whereas the 
later factors were conservative factors.* 

There was little bipolarity in any of the solutions. Again count- 
ing loadings .35 or greater as significant, there were, in the two- 
factor solutions, one such loading in SAQ and two in REFQ. In 
the three-factor solutions, there were no substantial negative 
loadings in SAQ, but there were four in REFQ. And in the four- 
factor solutions, there was one substantial negative loading in 
each of the sets of data. There were of course a number of smaller 
negative loadings, and among the original correlations there were 
a number of negative 7'8 greater than .30, but not enough to make a 
real difference in the final rotated factors. 


~“ 


TABLE 2 


Coefficients of Congruence between Factor Vectors of SAQ and REFQ, 
| л м Three-, and Four-Factor Solutions® 


2 Factors 3 Factors 4 Factors 
4 = —.05 —.05 95 —.05 —.12 .12 
1 ^ —10  .91  .70 .19 


—.12 .96 —.09 -90 -70 


ion are italicized. The matrices are 


= Coefficients between the first two factors of each soluti only the REFQ vectors with the 


asymmetric because the coefficients were calculated to compare 
BAQ vectors. 


i rts calculated from the individuals 
“actor anys which ea ЦРНИ 1953, pp. 174-179), were cal- . 


loaded substantially on the factors (Stephenson, с 

culated f E SAQ three-factor solution and the REFQ four-factor solution. 

The content of the items of these factor arrays corroborated the designations 
| of the factors as liberal and conservative. 
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Discussion 

The theoretical expectations outlined earlier have been sup- 
ported. The evidence of this study using two different Q sorts and 
two different forms of data analysis reinforces earlier evidence 
obtained from analysis of the items of summated-rating social 
and educational attitude scales (Kerlinger, 1967a, 1967b, 1970; 
Kerlinger and Kaya, 1959). It also reinforces the findings of edu- 
cational attitude Q studies done more than a decade ago (Kerlin- 
ger, 1956, 1958). 

The results obtained with a Q approach, especially with the 
referents Q sort, are important for three reasons. The first is that 
the theoretical expectations of duality of attitude structure and 
comparative lack of bipolarity have now been found using both 
Е and Q methodologies. Research findings are always strengthened 
when yielded by different approaches, methodologies, and meas- | 
urement instruments (see Cook and Selltiz, 1964). Two, the use of , 
both attitude statements and attitude referents has led to similar 
findings. One of the telling criticisms of previous work is that the 
construction and selection of attitude statements were perhaps 
biased by the author’s predilections (Zdep and Marco, 1969). 
This criticism loses much of its weight when a wide variety of ref- 
erents are used. There is much less chance of selective bias, and | 
item phrasing as a source of bias seems ruled out. In short, the use 
of referents was probably the most important part of this study. | 
And the high degree of agreement between the SAQ and REFQ ro- 
tated factor solutions add more evidence to the validity of the 
idea that liberalism and conservatism are two relatively orthog- 
onal factors underlying social attitudes. 

Considered alone, the study’s results are modest. The obvious 
limitations of Q methodology (Kerlinger, 1964, pp. 592-596), of 
course, put severe restrictions on the generality of the findings. 
Nevertheless, considered in conjunction with findings obtained 
with other methods and large samples, the present results are im- 
pressive. While the structural theory of attitudes under test has 
hardly enough supporting evidence to claim its validity without 

5 I 5 " 
E peg e n ee agg ню 
in North Carolina, Texas, and New York, was recently completed. The results 


are generally similar to those of this study: Duality of attitude structure, with 


ee lack of bipolarity, appeared clearly in second-order factor analyses 


J 
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qualification, it at least has enough supporting evidence to take 
it seriously and to challenge assumptions about the presumed 
polarity of social attitudes, assumptions that are probably not 
valid, or that are at least questionable. 
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PSYCHOMETRIC PROPERTIES OF THE I AND HE 
FORMATS IN PERSONALITY ASSESSMENT 


ALAN J. KLOCKARS 
University of Washington 


| to agree with the instructions. 


lations range from .74 to .90. 
Two aspects of 
lion. First, the same subjects responded 


all subjects responding to 
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Tun Edwards Personality Inventory (EPI) (1967) is designed 
io measure all important aspects of the normal personality. The 
instructions direct the subject to describe himself as those individ- 
uals who know him best would describe him. The instructions stress 
that it is the subject’s opinion about how others view him that is 
desired, rather than whether he thinks the statements accurately 
describe him. The statements are worded in third-person singular 


Although the concept of the self as perceived by others is not new 
in psychology, the manual for the EPI provides no rationale for 
this departure from the usual self-description format. One possible 
justification would be that the psychometric properties of the 
scales are more favorable than under standard instructions. Ed- 
wards (1969) presents data comparing the “I” and “He” formats 
for Booklet 1A of the EPI. These data indicate that the two for- 
mats yield comparable means, standard deviations and KR-20's, 
correlate to the same extent with the Social Desirability (SD) 
scale, and correlate highly with each other. The size of these corre- 


the Edwards study warrant further investiga- 
to both the I and the He 


formats. A time interval of only two days separated the tests, with 
the He format on the first test day. With 
this short time period between tests, the subjects may well have re- 
sponded from memory on the second occasion. Second, the scales 
in the particular booklet which was chosen do not have a wide 
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range of correlations with the SD scale. The similarity in correla- 
tions with the SD scale may have missed changes which occur for 
correlations outside the range included in Booklet 1A. 

Fiske (1969) also reports similarities in means, but larger vari- 
ances and reliabilities for the He format. The He format also tended 
to have higher positive intercorrelations between the scales in the 
booklet. Fiske also used Booklet 1A but with independent groups. 
It did not include the relationships of the scales with the SD scale, 

The present study was designed to compare the psychometric 
properties of the I and He formats using independent groups. Book- 
let IIL was chosen because it included scales with a wider range of 
correlations with the SD scale. 


Method 


The I form of the inventory was constructed by changing the 
instructions of the He form to ask the subject to describe himself. 
The 300 statements were rewritten using the first-person singular 
rather than third person. The SD scale was added at the end of the 
I form without any change in instructions, In the He form the SD 
scale was also added at the end, but the instructions were changed 
to self description as in the I form. 

Subjects were 267 students enrolled in various junior and senior- 
level Education courses at the University of Washington. Within 
each class the I and He forms were randomly distributed to the 
students. One hundred and thirty students responded to the He 
form and 137 to the I form. Subjects were instructed not to put 
their names on the answer sheets. 


Results 


| Table 1 presents the means, standard deviations, and correla- 
tions with the SD scale on the I and He forms for the 13 scales in 
Booklet III. 

Tests of significance were performed to determine if the means of 
the two forms differed significantly (« = .05). Scale A—Self- 
critical and Scale M—Virtuous showed significant differences (f 
= 3.67 and —3.48 respectively). Tests were also performed to 
determine if the variances differed significantly. Four of the scales 
had heterogeneity of variance. In three cases, Scale A—Self- 
critical, Scale E—Becomes Angry, and Scale K—Shy, the He form , 
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TABLE 1 
| Means, Standard Deviations, and Correlations with SD for EPI Scales 
| under He and I Formats 
Eo o 42 — a AS 
Standard Correlation 
Mean Deviation with SD 
| НЕ 1 НЕ 1 НЕ 1 
Е 
ЈА. 11.32 8.47% 8.21 5.02 Ly 
B. 10.62 9.17 7.50 6.50 —.26 – 12 
C. 8.43 8.23 3.80 3.84 —.33 —.21 
р. 12.84 13.03 5.39 4.94 —.16 08 
Е. 6.92 7.08 3.42 2.84% —.37 —.34 
Е. 15.82 15.68 6.13 5.87 — .08 .07 
G 5.99 5.69 3.37 3.07 —.09 – 19 
H. 16.09 ^ 15.35 3.01 8.51% .22 .18 
(I 19.81 19.88 4.36 4.22 .45 .18* 
Im 9.42 10.15 5.28 5.18 -.14 -.2 
к. 5.86 5.08 4.70 3.94% —.59 —.34* 
EL 5.58 5.55 3.30 3.10 .16 .10 
M. 12.16 10.77% 4.86 4.91 .23 42 
‚ BD. 29.60 30.42 5.900 4.89* ле re 
‚А, Self-critical H. Understands Himself 
B. Critical of Others I. Considerate 
С. Active J. Dependent 
D. Talks about Himself 3 К. Shy У 
Е. Becomes Angry f. Informed about Current Affairs 
F. Helps Others M. Virtuous. .— 
G. Careful about His Possessions SD. Social Desirability 


* Significantly different at a = .05. 


showed greater variability, while for Scale H—Understands Him- 
self the I form showed greater variability. The variances for the 
SD scale also showed more difference than expected by chance. The 
subjects who had completed the EPI under the He format had a 
| larger variance on the SD scale than those with the I format. 
| The overall difference between the variances was tested by Wil- 
coxon’s test for paired observations. Over the 13 scales the test 
| showed а significant difference (2 = 2.27), with the He format 
being more variable. 
Tests of difference between the correlations with the SD scale 
were run for all scales. Two of the scales showed significant differ- 
че These were Scale I—Considerate (2 = 2.44) and Scale K— 
|. Shy (z = 2.68). In both cases the correlation for the He format had 
| а higher absolute value. Wileoxon’s test was used to determine if 
there was an overall tendency for the correlations to be larger for 
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range of correlations with the SD scale. The similarity in correla- 
tions with the SD scale may have missed changes which occur for 
correlations outside the range included in Booklet 1A. 

Fiske (1969) also reports similarities in means, but larger vari- 
ances and reliabilities for the He format. The He format also tended 
to have higher positive intercorrelations between the scales in the 
booklet. Fiske also used Booklet 1A but with independent groups. 
It did not include the relationships of the scales with the SD scale. 

The present study was designed to compare the psychometric 
properties of the I and He formats using independent groups. Book- — 
let III was chosen because it included scales with a wider range of 
correlations with the SD scale. 


Method 


The I form of the inventory was constructed by changing the 
instructions of the He form to ask the subject to describe himself. 
The 300 statements were rewritten using the first-person singular 
rather than third person. The SD scale was added at the end of the 
Т form without any change in instructions. In the He form the SD 
scale was also added at the end, but the instructions were changed 
to self description as in the I form. 

Subjects were 267 students enrolled in various junior and senior- 
level Education courses at the University of Washington. Within 
each class the I and He forms were randomly distributed to the 
students. One hundred and thirty students responded to the He 
form and 137 to the I form. Subjects were instructed not to put 
their names on the answer sheets. 


Results 


Table 1 presents the means, standard deviations, and correla- 
tions with the SD scale on the I and He forms for the 13 scales in 
Booklet III. 

Tests of significance were performed to determine if the means of 
the two forms differed significantly (2 = .05). Scale A—Self- 
critical and Scale M—Virtuous showed significant differences (t 
= 3.67 and —3.48 respectively). Tests were also performed to 
determine if the variances differed significantly. Four of the scales 
had heterogeneity of variance. In three cases, Scale A—Self- 
critical, Scale E—Becomes Angry, and Scale K—Shy, the He form 
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TABLE 1 
Means, Standard Deviations, and Correlations with SD for EPI Scales 
under He and I Formats 
Standard Correlation 
Mean Deviation with SD 

HE I HE I HE I 
A. 11.32 8.47* 8.21 5.92 — .67 — .65 
В. 10.62 ^ 9.17 7.50 6.50 —.26 —.12 
C. 8.43 8.23 3.80 3.84 —.83 —.21 
р. 12.84 13.03 5.39 4.94 —.16 .08 
E. 6.92 7.08 3.42 2.84* —.37 —.84 
F. 15.82 15.68 6.13 5.87 —.03 .07 
G. 5.99 5.69 3.37 3.07 —.09 —.19 
H. 16.09 15.35 3.01 3.51* .22 .18 
I. 19.81 19.88 4.36 4.22 .45 .18* 
J. 9.42 10.15 5.28 5.18 .14 —.12 
K. 5.86 5.08 4.70 3.94* —.59 —.34* 
L. 5.58 5.55 3.30 3.10 .16 10 
M. 12.16 10.77* 4.86 4.91 .23 12 
SD. 29.60 30.42 5.96 4.89* — = 
А.  Belf-critical Н. Understands Himself 
B. Critical of Others I. Considerate 
C. Active J. Dependent 
D. Talks about Himself K. Shy 
E. Becomes Angry L. Informed about Current Affairs 
Е. Helps Others M. Virtuous 
G. Careful about His Possessions SD. Social Desirability 


* Significantly different at a = .05, 


showed greater variability, while for Scale H—Understands Him- 
self the I form showed greater variability. The variances for the 
SD scale also showed more difference than expected by chance. The 
subjects who had completed the EPI under the He format had a 
larger variance on the SD scale than those with the I format. 

The overall difference between the variances was tested by Wil- 
coxon’s test for paired observations. Over the 13 scales the test 
showed a significant difference (2 = 2.27), with the He format 
being more variable. 

Tests of difference between the correlations with the SD scale 
were run for all scales. Two of the scales showed significant differ- 
ences. These were Scale I—Considerate (2 = 2.44) and Scale K— 
Shy (z = 2.68). In both cases the correlation for the He format had 
a higher absolute value. Wilcoxon’s test was used to determine if 
there was an overall tendency for the correlations to be larger for 
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one of the formats. A significant value (2 = 2.30) was observed 
with the He format having larger correlations. 
Discussion 

The results of the present study indicate greater differences be- 
tween the two instructional sets and item formats than those re- 
ported by Edwards (1969). In general, the findings support the 
previous work of Fiske (1969). 

The most pervasive difference found is between the variances of 
the scores under the two formats. The He format has a larger vari- у 
ance in three of the four significant F ratios and over the set of 
scales is significantly larger. This finding replicates Fiske's. An un- 
explained finding is that subjects who had responded to the He for- 
mat in the inventory continued to show greater variability in their 
responses to the SD scale, even though this scale was taken under 
the same instructions and the same format as the group м an- 
swered under the I format. 

The increased size of the correlations between the scales of the 
EPI and the SD scale follows from work done by Edwards (1959). 
In this work, Edwards found that scores on personality scales were 
more predictable using the SD scale when the subject was describ- 
ing someone he liked very much than when the subject described 
himself. Assuming the people who know you well are also those who 
like you very much, we have much the same conditions which 
Edwards found to be most predictable using the SD scale. The main 
difference-is that the person being described is answering for his 
friend rather than actually having the friend providing the de- 
scription. 

In general, then, the scales of the EPI have both a favorable and 
an unfavorable psychometrie property which is related to the 
instructions and format of the inventory. The favorable character- 
istic is the increase in individual differences. The unfavorable char- 
acteristic is the increased correlations with the SD scale. In inven- 
tories where the overall level of the correlations with the SD scale 
are as low as in the EPI, the increase attributable to the instruc- 
tions may be offset by the favorable feature, but in most inven- 
tories, with the substantial proportion of variance accounted for 
by the SD scale, the added burden of the He instructions would 
appear unwise. 
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THE CONTRIBUTIONS OF QUESTIONNAIRE 
LENGTH, FORMAT, AND TYPE OF SCORE TO 
RESPONSE INCONSISTENCY 


MERLE Е. ACE an» RENE У. DAWIS! 


Industrial Relations Center 
University of Minnesota 


In methods of scaling based on the law of comparative judg- 
ment (Guilford, 1954; Torgerson, 1958), it is implicitly assumed 
that subjects make logically consistent choices between/among 
stimuli. In practice, choices are observed which take the form of 
circular triads (the choice of A over B, В over C, and С over А). 
Such circular triads may be taken as evidence of psychological 
equivalence among the stimuli. They may also reflect unreliability 
in the scaling (ie. in the instrument) and/or in the subjects 
(judges). As such, circular triad scores should be related to mea- 
surement error. Support for the latter expectation is to be found 
in recent studies by Weksel and Ward (1967) and Hendel and 
Weiss (1968), which show that circular triad scores are negatively 
related to internal consistency and test-retest reliabilities. 

Circular triads have also been interpreted as manifestations of a 
personality trait, sometimes identified as response inconsistency. 
Such an interpretation was adyocated by Gulliksen, Saunders and 
Tucker (1954), who reported that in two groups of college students 
at different universities, a curvilinear relationship was found be- 
tween number of circular triads obtained on а pair-comparison in- 
strument, and average grades. The same interpretation was uti- 
lized by Pemberton (1966) in a study of the work preferences of 


1 The senior author is now at the University of British Columbia. The au- 
thors are indebted to Marvin D. Dunnette and Dwight R. Kauppi for their 
evaluations of earlier drafts of this manuscript. 
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100 professional, technical, and clerical employees. He found that 
circular triads were made more frequently by persons who pre- 
ferred well-structured work environments, externally imposed sys- 
tem and order, and concrete rather than abstract tasks. 

To demonstrate that response inconsistency is a measurable per- 
sonality trait, it must be shown (among other things) that response 
inconsistency scores are not artifacts of the psychometric instru- 
ment itself. Three characteristics of the instrument which might 
contribute to variance in response inconsistency scores are length 
of the instrument, item format, and type of (inconsistency) score. 
ТЕ response inconsistency were indeed a measurable personality 
trait, then one would expect that: 


a. the proportion of variance contributed to inconsistency by in- 
strument length, item format and type of score would be small; 

b. the proportion of variance contributed to inconsistency by 
individual differences will be large; and 

c. the relationship between different types of inconsistency scores 
will be uniformly high. 


The present study was conducted to investigate the influence of 
these three instrument characteristics (length, format and score) 
on response inconsistency scores. 


Procedure 


Two forced-choice preference inventories were constructed from 
statements contained in the Minnesota Importance Questionnaire 
(Weiss, Dawis, England and Lofquist, 1964). These statements 
were the following: 


= 


I could do something that makes use of my abilities. 

. The job could give me a feeling of accomplishment. 

. I could be busy all the time. 

. The job would provide an opportunity for advancement. 
. 1 could tell people what to do. 

The company would administer its policies fairly. 

My pay would compare well with that of other workers. 
. My co-workers would be easy to make friends with. 

. I could try out some of my own ideas. 

I could work alone on the job. 

. I could do the work without feeling that it is morally wrong. 
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12. I could get recognition for the work I do. 
13. I could make decisions on my own. 


These statements are from the domain of work attitudes and have 
been used in measures of vocational needs. Inventory 1 consisted 
of all 13 statements. A subset of these, the first seven statements, 
constituted Inventory 2. The number of statements (7 and 18) 
were chosen from plans for incomplete block designs for use in con- 
structing multiple rank order instruments such that no two state- 
ments would appear together in any ranking block more than 
once (Cochran and Cox, 1957). 

Each inventory was presented in two formats. The first was a 
complete pair comparison format, each pair being presented twice, 
once in the first half of the instrument, once in the second half, in 
counter balanced or reversed (AB-BA) pair sequence. The second 
was a multiple rank order format using triads, the triads being pre- 
sented twice, once in the first half of the instrument, once in the 
second half with the statements in reverse (CBA) sequence. The 
reversed sequence feature was necessary to allow the derivation of 
three different types of inconsistency scores, as follows: 

The data from the multiple rank order format instruments were 
decomposed to pair comparisons data, making them comparable to 
the information from the other (pair comparisons) format instru- 
ments. For each instrument’s data, the pair comparisons were then 
organized in n X n matrix form (where n stood for the inventory 
Statement). The column headings represented statements appearing 
first in the pair of triad, while the row headings represented state- 
ments appearing last in the row or triad. Thus, the upper triangle 
of the matrix represented the first half (or the AB or ABC se- 
quence) of the inventories, while the lower triangle represented 
the second half (the BA or CBA sequence). 

Three inconsistency scores were derived from these matrices. 
Score 1 was the number of circular triads appearing in the upper 
triangle; Score 2 for the number of circular triads in the lower tri- 
angle; and Score 3 was the number of inconsistent responses made 
to the same pair of statements when upper and lower triangles were 
compared. 

All inconsistency scores were standardized by converting them 
to the corresponding proportion of highest score possible. The for- 
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mula for the maximum number of circular triads, where the num- 
ber of statements is odd, is given by Gulliksen and Gulliksen 
(1966) as 


d(max-n-odd) = n(n? — 1)/24 


Thus, for the 7-statement inventory, the maximum number of 
circular triads is 


7(7 — 1)/24 = 14 


For the 13-statement inventory, the maximum number of circular 
triads is 91. The highest possible score when comparing the same 
pairs in the upper and lower triangles is equal to the number of 
jtems in half of the inventory, that is, 21 for the 7-statement in- 
ventory and 78 for the 13-statement inventory. 

A total of 177 college sophomores participated as subjects in the 
study. Because the subjects were drawn from a college population, 
failure to understand instructions was not felt to be a problem. 
(The MIQ was designed for use with rank-and-file employees.) To 
minimize apathy or inattentiveness, individual score results were 
offered to all subjects, in addition to the extra points toward a final 
grade normally given for such participation in research. Knowl- 
edge of results was presumed to help minimize the possibility of 
deliberately invalid response. 


Results 


One hundred fifty-six subjects were assigned randomly to the 12 
conditions of a 2 X 2 X 3 completely crossed ANOVA design, with 
13 observations per condition. The three factors, with their corre- 
sponding levels, were: (a) questionnaire length (7 vs. 13 state- 
ments), (b) item format (pair comparison vs. multiple rank order 
triad), and (c) type of scoring (Score 1 vs. Score 2 vs. Score 3 as 
described above). The remaining 21 subjects were utilized in a one- 
way repeated measurements design, across scores. 

Table 1 summarizes the results of the three-way analysis of vari- 
ance. As shown in Table 1, the analysis yielded significant F tests 
at the .01 level on each of the three factors, that is, Length, For- 
mat, and Score. F tests were also significant at the .01 level for the 

Format Х Score interaction, and for the three-way interaction. Be- 
cause the proportion of variance accounted for by each of the ef- 
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TABLE 1 
Summary Table for 2 Х 2 Х 3 ANOVA with Inconsistency Scores 
as the Dependent Variable 
Source SS ај М8 Е о? 

Length (L) 476.87 1 476.87 8.37% .02 
Format (Е) 1000.71 1 1000.71 17.57% .05 
Score (S) 5580.12 2 2790.06 48.97* .31 
LXF 20.81 1 20.81 37 0 
LXS 261.80 2 130.90 2.30 0 
ЕХБ 1453.94 2 726.97 12.76* .08 
LXFXS 561.97 2 280.98 4.93* .03 
Error 8203.84 144 56.97 

Total 17560.06 155 


* Significant at the .01 level. 


fects was of interest, values for omega-squared, an indicator of 
strength of association (Hays, 1963), were calculated. The vari- 
ance accounted for by Length was 2%, 5% by Format, 8% by the 
Format X Score interaction, and 3% by the three-way interaction. 
The greatest proportion of variance, 31%, was accounted for by 
the Score factor. This was almost twice as great as the other ef- 
fects combined. 

Results of the one-way analysis of variance are summarized in 
Table 2. The F tests on both the Between and Within Score effects 
were found to be significant at the .01 level. Caleulation of omega- 
squared values showed that the Between effect, that is, individual 
differences, accounted for 30% of the varianee, and the Within 
Score effect accounted for 25% of the variance. 

A final analysis conducted was the calculation of Pearson prod- 
uct moment correlations among inconsistency scores within each 
of the four groups, that is, subjects categorized according to the 


TABLE 2 
Summary Table for the Repeated Measurement eid with Inconsistency Scores 
as the Dependent V. 

Source SS ај М5 Е о? 
Between 3051.07 20 152.55 3.09* 30 
Within 3783.03 42 

Score 1808.76 2 904.38 18.32* 25 

Residual 1974.27 40 49.36 

‘otal 6834.10 62 


a a a 


Significant at the .01 level. 


1008 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


two formats within the two inventory lengths. The results are 
shown in Table 3. The correlation coefficients ranged from —.06 to 
169 and with the exception of two, were significant at the .05 level. 
The r's between circular triads in the lower triangle of the 7- 
statement inventory and the two other scores of inconsistency 
were zero. This particular score had a mean of .13 and a variance 
of .16. Thus, it appears that the range for this score was so severely 
restricted that possible relationships with other variables would 
fail to appear. 
Discussion 
The significant effects from the three-way analysis of variance 
indicate that when one speaks of inconsistency, care should be 
taken to specify how many items were used, what the format of 
the items was, but especially, the manner in which the inconsist- 
ency score was derived. The cell means given in Table 4 show that 
subjects generally exhibited greater inconsistency on the longer in- 
ventory, the triad format, and the score obtained by a comparison of 
the upper and lower response matrix triangles. This did not hold 
true in every case. There were complex interactions between the 
factors studied. However, the proportion of variance accounted by 
the factors and factor interaction is relatively, and practically, 
small, except for the Score factor. This factor turned out to be the 


TABLE 3 
Correlations of Inconsistency Scores 


Circular Triads Circular Triads 
Upper Triangle ^ Lower Triangle 
13 Stem Pair Comparison (У = 43) 


Circular Triads (Lower Triangle) .40* 

"Upper & Lower Triangle Comparison .67** .69** 
13 Stem Triad (N = 46) 

Circular Triads (Lower Triangle) .30* | 

Upper & Lower Triangle Comparison .66** .46** 
7 Stem Pair Comparison (№ = 45) 

Circular Triads (Lower Triangle) .55** 

Upper & Lower Triangle Comparison .68** .48** 
7 Stem Triad (N = 45) 

Circular Triads (Lower Triangle)* —.06 

Upper & Lower Triangle Comparison ATE .09 


a With a mean inconsistency score of .13, the range was so severely restricted in this measure that 
possible relationships could not be determined. 

* Significant at the .05 level. 

** Significant at the .01 level. 


а. 


iw 
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TABLE 4 
Cell Means for 2 X 2 X 3 ANOVA with Inconsistency Scores 
as the Dependent Variable 
Format 

Pair Comparison Triad 

13—Statement Inventory 
Upper Triangle 9.47 7.10 
Lower Triangle 5.75 9.72 
Comparison 14.69 30.47 

7—Statement Inventory 
Upper Triangle 4.95 11.54 
Lower Triangle 6.04 1.10 
Comparison 10.62 21.98 


most important of the instrument characteristics studied, explain- 
ing 31% of inconsistency score variance in the three-way design. 

There was an additional factor in the first analysis which has not 
been mentioned. Both questionnaire length and item format imply 
a time element: the larger the number of items, the longer the time 
of administration. Administration time is also longer with a pair 
comparison format than with a multiple rank order format. Thus, 
time is nested within the Length factor and the Format factor in 
the first analysis of variance. 

In the repeated measurements experiment, the Score factor was 
again found to be significant. However, individual differences ac- 
counted for a greater proportion of the variance (30%) than did 
the Score factor (25%). 

Intercorrelations among the inconsistency scores were mostly 
positive and significantly different from zero. They were not as 
large as would be expected in terms of the Campbell and Fiske 
(1959) convergent validity scheme. It is obvious from these corre- 
lations that there is a great deal of method variance in the mea- 
sures of inconsistency utilized in this study. This finding reflects 
the large amount of variance accounted by the Score factor in the 
two experiments. 

Additionally, the finding that circular triad scores obtained from 
the upper triangles (first halves of the instruments) correlate no 
better than .55 with the same type scores obtained from the lower 
triangles (second halves) casts grave doubt on the reliability (sta- 
bility) of this type of score as a measure of a presumed personality 
trait of response inconsistency (or consistency). It is apparent 
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from the data reported here that the circular triad score is not a 
stable score. 


Summary 


An experiment was conducted to determine the influence of ques- 
tionnaire length (7 vs. 13 items), format (pair comparison vs. 
triad), and type of score (upper triangle circular triad vs. lower 
triangle circular triad vs. upper-lower triangle inconsistency) on 
response inconsistency (expressed as a ratio of observed frequency 
to highest possible frequency). A three-way analysis of variance 
showed that all three factors were significant, as were Format Х 
Score interaction and the three-way interaction. With the excep- 
tion of the Score factor, however, none of the factors or interaction 
terms accounted for more than 8% of the dependent variable vari- 
ance. The Score factor accounted for 31% of the variance. A one- 
way repeated-measures-design experiment confirmed the salience 
of the Score factor, with the within-score effect accounting for 
25% of the variance, while individual differences accounted for 
30% of the variance. Correlations among the inconsistency scores 
ranged from —.06 to .69, with a median of .48, further confirming 
the contribution of type of score to variability in the measurement 
of response inconsistency. 
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EXPECTED GRADE IN A COURSE, GRADE POINT 
AVERAGE, AND STUDENT RATINGS OF THE 
COURSE AND THE INSTRUCTOR 


В. BARKER BAUSELL лмо JON MAGOON 
University of Delaware 


SrupENT-INITIATED course evaluations have become institutional 
fixtures on a large number of college and university campuses in 
recent years. Аз long as the results from these evaluations are used 
solely by students for their avowed selective and expressive pur- 
poses, then, like the results of a public election, they are as valid 
as they are representative of the opinions of the student population. 
Due to the increased use of the results from course evaluations by 
other constituents of the academic community, however, the ques- 
tion of validity is not so easily dismissed. 

Many faculty regard student evaluations of their courses as an 
indication of their teaching success, and may actually allow the 
results to shape their subsequent pedagogical behavior. There is 
reason to believe that administrators are increasingly using the re- 
sults of course evaluations as an operational measure of teaching 
effectiveness (Pierrel 1968), usually one of several criteria for 
faculty promotion. In these instances, a determination of those 
factors which influence ratings, but are at odds with (or extraneous 
to) the instructor's purposes, is crucial. 

- Researchers have long suspected that a student’s performance 
1n a course, as well as his general academic ability, may bias his 
rating of that course and its instructor. There is, however, a degree 
of ambiguity in the findings to date. Remmers (1930), one of the 
first to investigate the issue, found a point-biserial correlation of 
only 0.07 between grade received and student trating of instructors. 
He later found no relationship even when controlling for scholastic 
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aptitude (1947). Bendig (1953), however, criticized this later study 
because of the low correlation between the aptitude measure and 
the final grade. Using the more powerful tool of analysis of vari- 
ance, Bendig found a slight relationship between level of course 
achievement (grade) and course ratings, however no relationship 
was observed between achievement and instructor ratings. Sim- 
Папу, Russell and Bendig (1953) found that overachievers (Ss 
receiving a higher grade than their scholastic aptitude "predicted") 
rated the course generally higher than underachievers, although the 
two groups differed only sporadically on instructor ratings. Ani- 
keeff (1953) on the other hand, using an index of grading leniency, 
and utilizing a large sample (39 instructors rated by 1500 stu- 
dents) found that as much as 25% of the variance in instructor 
ratings was accounted for by the grades which their students re- 
ceived. Blum (1936) found no relationship at all between grade 
received and instructor ratings. í 

All the research reviewed above, however, investigated the rela- 
tionship between the grade the student received and his ratings of 
the instruetor and the course. One obvious problem with this ap- 
proach is that at the time of his rating the student may have ez- 
pected a grade other than the one he received. Blum (1936) found 
that 22% of the Ss in an undergraduate psychology course expect- 
ing "A's" actually received a lower grade, while those receiving 
“D’s” exhibited a marked tendency to overestimate. The grading 
procedure in this study was rigidly defined, hence the correlation 
between expected grade and obtained grade is probably an over- 
estimate with respect to the universe of college courses. This tend- 
ency to overestimate the grade which a subject actually receives 
would, of course, legislate against finding a relationship between 
received grade and ratings. 

Blum also investigated the effect of expected grade on instructor 
ratings in the same study, again finding no connection. However, 
the rating scale used was extremely limited: each student could 
only rate the instructor as either deserving of an “A,” “B,” “C,” 
“D,” "E," or “F.” Garverick and Carter (1962) also found no rela- 
tionship between expected grade and instructor ratings, however, 
both the sample (164 students taking the same course from the 
same instructor) and the analysis employed (cluster analysis) may 
have been inappropriate to test the hypothesis. Weaver (1960) 
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employing а more diverse sample found that Ss expecting higher 
grades tended to rate the instructor’s teaching techniques as more 
effective, although no difference was found in respect to items re- 
ferring to the instructor's personality. 

The purpose of the present study was to resolve the ambiguity 
generated by the diverse findings above by addressing more parsi- 
monious (multiple-group discriminant analysis) and more power- 
ful (multivariate analysis of variance) statistical procedures to the 
following questions: (1) Is there a relationship between a student's 
expected grade in а course and his ratings of that course and its 
instructor? (2) Is there a relationship between general academic 
achievement (grade point average) and instructor-course evalua- 
tions? (3) Are discrepancies between a student's general academic 
achievement and his expected grade in а course reflected in his 
ratings of the course and its instructor? The statistical procedures 
used closely follow those outlined by Cooley and Lohnes (1962). 


Sample and Procedures 


Over 17,000 individual ratings of instructors were obtained in 
university-wide course evaluations in the fall semester of 1969 at 
the University of Delaware for those instructors who allowed the 
student-run evaluations in their classrooms. Of these ratings 31% 
were found incomplete, and were dropped from further considera- 
tion, leaving approximately 12,000 complete ratings from which 
samples were drawn for this study. Each of the rating forms con- 
tained information about the rater, including the expected grade 
(EG) of the rater and the rater's grade point average (GPA). 

The approximate distribution of EG across all courses repre- 
sented in the ratings available for analysis were as follows: 13% 
of the students expected A's, 40% of the students expected B's, 
36% of the students expected C's, and 5% of the students expected 
D's. These proportions of the EG would be expected to differ some- 
what from those within classrooms, for review of 35 randomly se- 
lected classrooms revealed frequent distributional differences. In 
no case were the classroom EG distributions so skewed that less 
than three types of grades were expected. A sample with an N of 
500 was randomly selected from the pool of usable data to repre- 
sent each of the four “expected grade” categories. 

The GPA represents a cumulative average of “quality points” 
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over all courses the student has taken, where A = 4, В = 3, С = 
2, D = 1. To some extent this is looked upon as an index of success 
in college. The average GPA is known to increase slightly for the 
classes as they pass from freshman to senior year. Freshman were 
not included in the GPA samples, since the instrument was given in 
the fall before they had obtained any quality points. Five cate- 
gories for the GPA were utilized, where the total percentage of stu- 
dents in each was approximately as follows: 1.0 to 1.4, 4%; 1.5 to 
1.9, 896; 2.0 to 2.4, 35%; 2.5 to 2.9, 3096; 3.0 to 4.0, 23%. A sample 
with an М of 450 was randomly selected to represent each of the 
five GPA categories. 

The “discrepant expected grade" (DEG) was found by deter- 
mining roughly how much an individual's EG in the course he was 
rating differed from his GPA. Five categories were again estab- 
lished, with sample sizes as follows: two or more quality points 
below GPA, N; = 115; one point below, № = 600; no difference 
from GPA, № = 600; one point above GPA, № = 600; two ог 
more points above GPA, № = 285. 

The course evaluation instrument used was typical of many cur- 
rently used by student association groups for the express purpose 
of public instructor evaluation. This particular instrument had 
ancestral ties to the Purdue Rating Scale for Instruction (Rem- 
mers, 1960), from which many items were suggested or modified 
by student committees and their faculty advisors. The instrument 
consisted of 29 evaluative rating items (5-point bipolar scales) 
covering course workload, text quality, instructor quality, and 
course structure. The average ratings published for each course 
had a reliability dependent upon the number of raters, and as 
Remmers has pointed out (1930), the Spearman-Brown formula 
applies. The mean correlation for ratings of random pairs of raters 
within 30 randomly selected classrooms was found to be 36. When 
28 student raters are involved (the median number in the courses 
surveyed) the reliability would be estimated to be :94. The те- 
liability of mean ratings for individual items would necessarily be 
low for almost any size classroom, for the instrument has effec- 
tively been shortened by a factor of nearly 30. Nevertheless, it 
seemed appropriate to analyze these data using individual rating 
items as dependent measures, for there was no a priori rationale 
for the determination of composite scores, and the study was heu- 
ristic in nature. 
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Table 1 contains a condensed description of the rating items and 
the means on each item for each EG category. Mean differences on 
items between the end categories (expecting “A” and expecting 
“D”) were as large as 1.24 units on the 5-point rating scale. It 
should further be noted that almost all rating means formed a per- 
fect rank ordering with the EG. Univariate F statistics were cal- 
culated for group differences on each of the 29 rating items, and in 
each case were significant at the .001 level. The F ratio for the mul- 
tivariate test of overall group differences was 14.5, and with large 
degrees of freedom (тај, = 87, ndf2 = 5889) was significant at far 
beyond the .001 level. 

The rating means for the five GPA categories varied very little 
in comparison to expected grade differences, and never exceeded 


TABLE 1 
Rating Means and Univariate F Ratios for Expected Grade Analysis 
A B с р 
Rating Items N = 500 N = 500 N = 500 N = 500 Р Ratio* 


1. Method effectiveness 3.84 3.45 3.23 3.00 63.5 
2. Reading load 1.84 2.26 2.16 1.90 16.3 
3. Relaxed atmosphere 3.99 3.64 3.48 3.24 67.0 
4.  Explicitness 3.90 3.73 3.56 3.37 26.0 
5. Accurate evaluation 3.96 3.53 3.27 3.04 84.4 
6. Absences 1.44 1.57 1.59 1.77 10.0 
7. Outside study 2.41 2.73 2.82 2.72 18.5 
8.  Instructor's interest 4.37 4.09 3.98 3.91 23.5 
9. Opportunities to question 4.26 3.77 3.72 3.56 42.0 
10.  Instructor's effectiveness 3.89 3.47 3.42 3.26 34.4 
11.  Instructor's organization 3.89 3.69 3.60 3.47 14.8 
12.  Instructor's presentation 3.92 3.61 3.48 3.15 45.0 
13. Intellectual stimulation 3.66 3.30 3.17 2.93 39.0 
14.  Instructor's respect 4.23 3.75 3.66 3.52 52.0 
15. Grading fairness 4.20 3.72 3.51 3.28 85.7 
16. Course evaluation 4.02 3.56 3.35 2.86 109.0 
17. Instructor evaluation 4.20 3.73 3.62 3.89 52.8 
18. Textbook evaluation 3.53 3.40 3.17 2.91 35.8 
19. Value of lecture 4.00 3.63 3.47 3.35 29.7 
a: Value of discussion 3.57 3.16 3.13 3.06 19.4 
E. Value of assignments 3.98 3.91 3.80 3.66 9.4 
- Relevance of course 4.00 3.70 3.25 2.76 105.0 
у of material 2.98 3.20 3.42 3.70 104.4 
a. eae 3.00 3.26 3.40 3.65 62.7 
E culty of examinations 3.15 3.59 3.83 4.08 159.1 
X TANT reading kind 2.75 3.16 3.24 3.14 39.6 
2 eavy work load 2.93 3.31 3.33 3.41 44.3 
a Emphasis on conformity 2.77 3.00 3.03 3.1 16.2 
. Emphasis on creativity 3.03 2.69 2.60 2.57 27.2 

* All F-ratios (3, 1996) are significant at well beyond the .001 level. 
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one third the scale unit. The GPA groups’ means on only three 
items differed enough to yield F ratios significant at the .01 level, 
while the differences on five others were significant at the .05 level. 
The overall F statistic for the test of multivariate group differ- 
ences was 2.0, which is significant at the .01 level with ndf, = 116, 
тај; = 8811. 

An examination of the means of the five DEG groups on the 29 
items revealed an almost identical pattern to the one obtained for 
the EG. Those items whose means varied linearly in the EG analy- 
sis (Table 1) also yielded perfectly linear means (with 0 excep- 
tions) for the DEG analysis. This is not particularly surprising 
since the proportion of expected grades also varied linearly for 
the discrepant groups. For example, 23.5% of Group 1 ex- 
pected to receive a “C” and 76.5% expected “D’s,” while Group 5 
(expected grade 2 or more points above GPA) contained 8% of 
its Ss expecting *B's" and 92% expecting "A's." 

A far more interesting question was whether the Ss within these 
groups rated instructors differently than would be predicted from 
a knowledge of their EG alone. To answer this question, projected 
means for each of the five groups for each item were calculated 
using the proportions of expected grades found in each group and 
the means listed in Table 1. The means of items 2, 7, and 26 were 
dropped from the subsequent analysis since they did not yield 
linearly varying means in the expected grade analysis. All means 
in Group 3 were also dropped, since Ss expecting grades congruent 
to their GPA were not relevant to the question. Each of the 104 
means (26 items X 4 groups) obtained in the discrepant analysis 
was then compared to its corresponding projected mean. In 79 out 
of 104 cases the obtained mean was more extreme (in the direction 
of the bias) than the projected mean (Chi-Square = 27.01, р < 
001), indicating that those individuals expecting grades lower 
than they normally received rated instructors and courses lower 
than their expected grades predicted. Conversely, individuals ex- 
pecting atypically high grades were kinder in their evaluations. 

Twenty six of the univariate F ratios for DEG group differences 
were significant at the .001 level, while the rest were significant at 
the .01 level. The multivariate F ratio for overall group differ- 
ences was 7.8, which with ndf, = 116 and пар» = 8613 is sig- 
nificant beyond the .001 level. 


| 
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The statistically significant discriminant function(s) separating 
groups in each of the three cases are described by means of the dis- 
criminant function loadings in Table 2. The loadings cited all 
exceed |==.30| (smaller loadings were disregarded) and represent 
the linear correlation between individual rating item scores and 
scores on the discriminant function. These may be considered quite 
analogous to factor loadings, except that in this case the “factor” 
is the discriminant function. The entries below each discriminant 


3 function are, in order, a general variance ratio called Wilk’s 
Lambda statistic, the canonical correlation coefficient (Re) ex- 
| TABLE 2 
8 Discriminant Function Loadings for Three Grade Effects 
| Expected Grade Grade Point ^ Discrepant 
| Rating Items I п Average Expected Grade 
1. Method effectiveness „472 „415 
2. Reading load .551 
3. Relaxed atmosphere .484 .522 
4. Explicitness .312 .315 
5. Accurate evaluation .534 519 
` 6. Absences — .304 
7. Outside study .391 
8. Instructor's interest 
и: Opportunities to question .364 — .304 „439 
` 10. Instructor’s effectiveness .337 „438 
_ 11. Instruetor's organization .368 
12. Instructor's presentation .403 .440 
. 18. Intellectual stimulation .375 411 
_ M. Instructor’s respect .406 — .333 „443 
15. Grading fairness .534 -556 
16. Course evaluation „602 „650 
17. Instructor evaluation „424 „464 
„18. Textbook evaluation .360 .384 
19. Value of lecture .322 
20. Value of discussion — .374 .307 
21. Value of assignments 
22. Relevance of course .591 .615 
23. Difficulty of material —.593 —.585 
24. Difficulty of readings —.478 —.495 
25. Difficulty of examinations —.701 —.678 
26. Heavy reading load .581 — .323 
27. Heavy work load —.360 .377 — .391 
28. Emphasis оп conformity 
29. Emphasis on creativity —.315 
Wilk's Lambda .56 .91 .90 .67 
Canonical В, .62 .28 .25 .535 
Chi Square 1150 189 234 868 


(р < .001) (р < .001) (р < .01) (р < .001) 
=. Degrees of freedom 116 т 84 145 1 
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pressing a standardized correlational relationship between an 
optimally-weighted composite of the rating items and the discrim- 
inant function, and the Chi-square statistic yielding an approxi- 
mate significance test for the canonical coefficient. It should be 
noted that Wilk’s Lambda statistic is a ratio of the generalized 
multivariate within-groups variance divided by the corresponding 
total variance, and thus provides an indirect but straightforward 
estimate of the proportion of rating variability that can be at- 
tributed to group differences. The discriminatory power of the EG 
groupings is quite substantial, for example, because 44% (1.е., 
1-.56) of the rating variability is accounted for by expected grades. 
The statistic *1-Wilk's Lambda” and the interpretation of dis- 
criminatory power are discussed at length by Tatsuoka (1970). 

The EG groups were found to differ in two independent ways, 
but the first discriminant function is clearly much more important 
(Б, = .62) than the second (В, = .28). The first discriminant 
function has the four EG centroids projected along it at fairly 
equal intervals and in their natural order, with the “expecting A” 
group at the positive extreme. Thus from the algebraic signs of the 
individual item loadings it is determined, for example, that the 
“expecting A” group rated the teaching method (item 1) the most 
effective, but rated the difficulty (item 23) lowest of any group. 
This discriminant function was composed of contributions from 
most of the more subjective items where the rater was asked to 
qualitatively rate various aspects of the course. The heaviest 
loadings were for ratings of examination difficulty, overall course 
quality, difficulty of classroom materials, and relevance of the 
course, The discriminant function can be described as a general 
course and instructor satisfaction dimension, with the emphasis 
placed more heavily on the course. 

A second discriminant function for the EG groups described a 
difference of importance between (a) a combination of the “ех- 
pecting A” and “expecting D” groups, which both had centroids 
nearly together at the negative extreme, and (b) the “expecting 
В” and “expecting С” groups whose centroids were located closely 
together at the positive end. The items loading on this function in- 
dicated that after general satisfaction is accounted for, students 
expecting a “B” or “C” grade tended to report more Teading ma- 
terial was assigned, that more outside study was undertaken, that 
the reading and workloads were heavier, and that there was less 
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opportunity to question in elass, less creativity, and less instruc- 
tor respect for students as individuals. This discriminant function 
reflected the perceived differences between these pairs of groups on 
a dimension probably best described as workload and interaction 
with the instructor. 

The GPA groups differed significantly along only one discrim- 
inant function, but it is clear from the Lambda statistic or canon- 
ical correlation that this was not a strong differentiation. Group 
centroids were again ranked fairly evenly along the function in 
their natural order, with Ss in the lowest GPA category at the 
negative extreme, and Ss in the highest category at the positive 
extreme, The two items which contributed most to this differ- 
entiation involved estimated absences and the value of class dis- 
cussion. Students with successively lower GPA were more often 
absent from class, but paradoxically valued the class discussion 
more highly. 

The DEG groups differed significantly along a single discrim- 
inant dimension, which again ranked group centroids in the same 
order as the means occurred in the EG analysis approximately 
equidistant from one another. The centroid of the group expecting 
a grade two or more points below the individual GPA was located 
at the negative extreme, and those expecting a grade two or more 
points above their GPA were located at the positive extreme. The 
discrimination was strong, and was quite similar in structure to the 
first discriminant function between EG groups (as indicated by 
Wilk’s Lambda or the canonical coefficient). This function was 
again a general course and instructor satisfaction dimension, with 
slightly more emphasis upon the course. 

Discussion 

The results of the present study run counter to many of the re- 
search findings cited above which found no relationship between 
students’ expected grade in a course and their ratings of that course 
and its instructor. The present study found strong, consistent bi- 
ases in both instructor and course ratings which can be traced to 
(a) the grade the student expects to receive, and (b) the discrep- 
ancy between the student’s expected grade and his GPA. The rela- 
tionship between GPA and ratings alone is negligible, and should 
not be considered an important source of bias. 

There are several possible explanations for the discrepancy be- 
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tween the results of the present study and previous related re- 
search. In the first place, the samples employed in the present study 
were drawn from a large undergraduate student population rather 
than from a few classes taught by an even more limited number of 
instructors. Secondly, the ratings used in this study were coordi- 
nated by a student government committee for the express purpose 
of publicizing students’ evaluations of courses and instructors. 
There exists a distinct possibility that student raters wished by 
their ratings to reflect their own subjective perception of the course 
and instructor as a guide for future students’ course and instructor 
choices. Indeed, this was one of the rationales for the campus- 
wide promulgation of the results of the evaluation. There is no way 
to determine whether the observed biases were conscious or not. 
Motives, from an operational point of view, have little relevance 
in the present context. Students’ varying amounts of success in the 
educational enterprise are reflected in their ratings. If an instructor 
or his administrators value high ratings, then low ratings attribu- 
table to the reported bias are retributive. 

The final contributor to the disparity between the present and 
past findings may be traceable to the rating instrument itself, 
which differs markedly from instruments used in past studies in 
both the number and range of items. However, the instrument used 
in the present study does not differ substantively from many such 
instruments in current use, and, due to the size and reliability of 
the observed bias, similar results would have to be predicted for 
other instruments used in a similar context. 

Student ratings of instructors are by their very nature axio- 
matically valid for their designed purpose, but must be interpreted 
with caution. Some instructors with low ratings may have excellent 
justification for the assignment of low grades; administrators using 
the results of such ratings for promotional and other criteria (un- 
related to the original purpose of the instrument) should take 
cognizance of this factor. It is an empirical question whether 
greater clarity in the justification of the grades assigned in a course 
would make the grading system appear less arbitrary to students, 
and hence result in less retributive instructor ratings. 
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ON “ESTIMATES OF COEFFICIENT ALPHA FOR 
FINITE POPULATIONS OF ITEMS” 


KEN SIROTNIK 
University of California, Los Angeles 


A warning is in order for the readers of “Estimates of Coefficient 
Alpha for Finite Populations of Items" (Sirotnik, 1972)—do not 
use any of the finite formulas if you are interested in computing 
meaningful indices of internal consistency or measurement error. 
Defining true score as the number of items correct in a finite popu- 
lation of items provides nothing useful for the practitioner. (Try 
to interpret formula 3 when m — M.!) The poor teacher in the ex- 
ample given towards the end of the article erred miserably in his 
estimation of а for his 60-item population. Had he been wiser, he 
would have computed « for his 30-item sample using the usual 
estimation formula а = (MSzg-MSig)/MSg obtaining .61. Не 
would have then augmented the coefficient using the usual 
Spearman-Brown formula for а test twice as long, obtaining .76. 
Tn general, denoting the estimates of alpha for the m-item sample 
and M item-population e, and ом respectively, 

ам = Ma,/[n + (M — m)a,]. 
The teacher should also be advised to consult Shoemaker (1972) 


for procedures and estimation formulas for « in multiple matrix 
(examinee-item) sampling designs. 


REFERENCE 
Shoemaker, D. М. A FORTRAN ТҮ program for estimating par- 
ameters in multiple matrix sampling with standard errors of 
estimate approximated by the jackknife. Southwest Regional 
Laboratory, Technical Note, 1972. 
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1 The author is indebted to Dr. David Shoemaker for pointing out this con- 
ceptual difficulty as well as associated numerical discrepencies resulting from 
the application of formula 3. 
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CONVERGENT AND DISCRIMINANT VALIDATION OF 
THE FRENCH AND GUILFORD-ZIMMERMAN SPATIAL 
ORIENTATION AND SPATIAL VISUALIZATION 
FACTORS 


GARY D. BORICH 
University of Texas 
PATRICIA M. BAUMAN 
Indiana University 


ALTHOUGH researchers generally agree that a spatial ability 
factor exists, there has been controversy concerning the nature of 
the construct and its subfactors. The existence of several spatial 
factors and instruments for their measurement have been posited 
by French (1951), French, Ekstrom, and Price (1962) , and Guilford 
and Zimmerman (1956). 

After reviewing several factorial studies, French (1951) de- 
scribed two spatial factors: spatial orientation and spatial visuali- 
zation. French defined spatial orientation as the aptitude to re- 
main unconfused by the changing orientations in which a spatial 
configuration may be presented and spatial visualization as the 
a aptitude to comprehend imaginary movement in three-dimensional 
| Space. 

French et al. (1962) selected two tests for the measurement of 
these constructs. The spatial orientation test requires the com- 
parison of two cubical blocks. The respondent is asked to indicate 
whether the two blocks are the same or different according to 
symbols written on their faces. 

The French visualization test requires an examinee to imagine 
the folding and unfolding of a piece of paper which, when folded, 
has been perforated (simulated by circles drawn on the paper) 

~ Опе or more times. Out of five alternatives an examinee must 
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choose the alternative which represents the paper after it has been 
unfolded and the perforations have been made. 

Guilford and Zimmerman (1956) postulated two aptitudes which 
they also called spatial orientation and spatial visualization. Two 
tests of the Guilford-Zimmerman Aptitude Survey were designed 
to measure these constructs. The authors referred to spatial 
orientation as an ability to appreciate spatial relations with ref- 
erence to the body of the observer. The awareness of whether 
one object is to the right or left, higher or lower, or nearer or 
farther than another is the essential nature of their factor. 

The Guilford-Zimmerman test for spatial orientation requires 
an examinee to imagine that he is riding in a boat whose prow 
is always visible in the foreground of the pictures comprising 
each item. In the first picture one sees the prow of a boat and 
some portion of the seascape in front of the boat. In the second 
picture the boat has changed its position. The examinee is asked 
to compare pictures to determine the boat’s new heading prior to 
marking one of five alternatives. 

Guilford and Zimmerman described spatial visualization as a 
process of imagining movements, transformations, or other changes 
in visual objects. The Guilford-Zimmerman test for spatial visuali- 
zation consists of a picture of an alarm clock and a sphere with 
directional arrows. The respondent is asked to visualize the rota- 
tion of the clock as it is moved into different positions according 
to the directions of the arrows. One out of every five choices 
pictures the clock in its final position. 

French et al. (1962) and Guilford and Zimmerman (1956) 
posited two sets of traits which are generally equivalent to each 
other. These traits and the meanings ascribed to them by their 
authors are: 


French et al.: 

Spatial orientation (SO) definition: remaining unconfused by 
changing orientation. 
task: determine the similarity or dif- 
ference in cubical blocks from symbols 
on their faces. 

Spatial visualization (SV) definition: comprehending imaginary 
movement in three-dimensional space. 
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task: follow movement of paper with 
holes from folded to unfolded position. 


Guilford-Zimmerman: 


Spatial orientation (SO) definition: awareness that one object is 
higher or lower, left or right, nearer or 
farther than another. 
task: determining a boat’s position from 
changing seascape. 


Spatial visualization (SV) definition: the process of imagining 
movements, transformations, or other 
changes in visual objects. 
task: follow movement of an alarm clock 
from directional arrows. 


The multitrait-multimethod matrix (Campbell and Fiske, 1959) 
is a technique for examining convergent and discriminant validity, 
prerequisite to the utility of traits and the tests used to measure 
them. Convergent validity is a confirmation of traits by indepen- 
dent measurement methods that requires a significant correlation 
between two different methods measuring the same trait. Dis- 
criminant validity requires that the correlation between different 
methods measuring the same trait exceed (a) the correlations ob- 
tained between that trait and any other trait not having method 
in common and (b) the correlations between different traits which 
happen to employ the same method. Variance among test scores 
сап be due to method and/or trait factors. The multitrait- 
multimethod matrix presents all the intercorrelations which result 
when selected traits are measured by two or more methods. 

Purpose and procedure. The purpose of the present study was 
to assess the convergent and discriminant validity of the tests 
for SO and SV selected by French et al. (1962) and constructed 
by Guilford and Zimmerman (1956). Forty randomly selected 
college sophomores who had no previous knowledge of the SO and 
SV instruments were subjects for the study. The Guilford-Zim- 
merman tests and Form 1 of the French tests were administered 
In а classroom setting according to the published instructions. 
Pearson product-moment correlations were computed for the 
multitrait-multimethod matrix appearing in Table 1. 
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"TABLE 1 
Multitrait-Multimelhod Matriz for French and Guilford-Zimmerman 
80 and SV Tests» 
Guilford-Zimmerman Trench 
80 SV 80 SV 
Guilford- 
Zimmerman so (.88)> 
БУД 


67 (:98) 
80 E (.60) 
Trench SV E 155 m» 


ар < .05 for all correlations. 

Ъ Alternate forms reliability reported by Guilford and Zimmerman (1956). 

* Kuder-Richardson 21 reliability reported by Guilford and Zimmerman (1956). 
4 Alternate forms reliability determined by the authors. 


Results and conclusions. Values in the diagonal represent the 
convergent validity data. Significant correlations between the 
French and Guilford-Zimmerman methods of measuring SO and SV 
indieate that both tests exhibit convergent validity. 

The remaining correlations comprise data for diseriminant val- 
idation. Three validity coefficients outside of the diagonal includ- 
ing both correlations between traits and within method exceed 
values within the diagonal. The between-traits and within-method 
correlations indicate that variance attributable to the methods ex- 
ceeds variance which is attributable to the traits. 

Although the validity diagonal demonstrates convergent valid- 
ity, there is little evidence of discriminant validity. Since correla- 
tions of Guilford-Zimmerman SV with SO and French SV with 
SO exceed the validity diagonal values, the authorship of the tests 
comprises a larger contribution to the correlations than do the 
hypothesized traits. 

There is other evidence to indicate that both method and trait 
may be in common to SO and SV. For example, Roff (1952) ob- 
tained a correlation of .75 between SO and SV, a value close to 
the reliabilities of the SO and SV tests cited by Michael, Guil- 
ford, Fruchter, and Zimmerman (1957). 

Smith (1964) argued that in general authorities have yet to 
demonstrate distinctions between the two hypothesized factors. 
Smith concluded that a test which requires attention to the details 
of a configuration probably measures g, Spearman’s general in- 
tellectual factor, more than spatial ability. If Smith is correct, the 
Guilford-Zimmerman SV test might fall short of this criterion 
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for a true spatial test. It is possible to complete the items of that 
test by fixating on some of the details of the alarm clock, e.g., 
the stand on which it rests or the buttons on its back, as one 
imagines the movement of the clock. The test might thus be a 
measure of g or some subfactor of g. The same observations apply 
to the French SO test, since fixating on one of the symbols of a 
block would appear to facilitate success on the test. 

There is evidence that when both SO and SV are measured with 
either the French or Guilford-Zimmerman tests that the variance 
due to authorship is greater than that due to trait. From related 
research Smith contended that SO and SV may not be distinct 
traits and suggested an additional rationale for between-trait 
and within-method correlations exceeding the validity diagonal 
values. 
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THREE-CHOICE VERSUS FOUR-CHOICE ITEMS: 
IMPLICATIONS FOR RELIABILITY AND VALIDITY 
OF OBJECTIVE ACHIEVEMENT TESTS! 


FRANK COSTIN 
University of Illinois at Urbana-Champaign 


In measuring psychology students’ knowledge of empirical gen- 
eralizations, Costin (1970) found that mean discrimination in- 
dices and estimates of homogeneity were slightly higher for three- 
choice items than for four-choice items. (“Discrimination” was 
estimated with the D-index [Findley, 1956] and homogeneity with 
the Kuder-Richardson Formula 20.) In view of these results and 
their relationship to previous investigations concerning the number 
of alternatives in objective achievement tests, Costin (1970) sug- 
gested that teachers in the natural and social-behavioral sciences 
who employ four-choice items would find it profitable to shift to 
three-choice items; they could then increase the efficiency with 
which they covered course content without reducing the homo- 
geneity and discriminating power of their tests and at the same 
time make the task of test construction less arduous and time- 
consuming, 

However, the findings of the study were based on a relatively 
restricted sample: 200 students in introductory psychology classes 
at Chanute Air Force Base. As a check on the results, and as a 
basis for wider generalization, the study was extended to a large 
introductory course at the Urbana-Champaign campus of the Uni- 
versity of Illinois, This course included a broad spectrum of the 
student body, since it is required in many curricula, and also is 
widely sought to fulfill general education requirements. 
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Method. From an extensive pool of four-choice items, 100 were 
selected that measured empirical generalizations dealing with the 
topics covered in the last half of the course: intelligence, person- 
ality, social interaction, and behavior disorders (Costin, Dulany, 
Greenough, Lieberman, and Peterson, 1972). Items were chosen in 
proportion to the extent to which videotape presentations and 
reading assignments dealt with these topics. Half of the items 
were constructed by the investigator; the remaining items were 
drawn from a confidential test file prepared by the authors of the 
textbook used in the course. (Hilgard, Atkinson, and Atkinson, 
1971). Fifty of these 100 items were then selected at random for 
reduction to a three-choice format; the reduction was accom- 
plished by discarding randomly one of the three distractors 
in each item. All items were then arranged according to topics, the 
order of the items being randomized within each topic, and were 
administered as the final examination of the course to 1566 stu- 
dents. 

For purposes of analysis, the two types of items were considered 
as constituting two different tests. In addition to calculating 
KR-20 estimates of homogeneity for each test, the means’ of the 
correlations between a test item and the total test score (point bi- 
serial correlation) were obtained. Since the point biserial correla- 
tion is simply a special case of the Pearson product moment corre- 
lation, Fisher’s transformation was used to calculate the means. 
In addition, the mean number of items answered correctly, the 
standard deviation, the median of the scores, and the standard error 
of measurement were calculated for each test. 

Results and conclusions. Table 1 shows the results of the item 
analyses. There were practically no differences between the various 
measures. The almost identical values of the mean point biserial 
T's are especially worth noting. Although the point biserial correla- 
tion is conventionally thought of as reflecting “discriminatory 
power,” it might be better considered as a measure of homogeneity, 
since the higher the coefficient, the greater degree of homogeneity 
it has with the rest of the items (Horst, 1966; Humphreys, 1970). 
Furthermore, in the instance of an achievement test the average 
point biserial coefficient could also be interpreted as reflecting in- 
ternal validity in that the total test score constitutes the criterion 
of performance. 
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TABLE 1 


Item Analysis of Three-choice and Four-choice Tests of Achievement 
in Introductory Psychology 


3-choice 4-choice 


Number of items 50 50 
Number of scores 1566 1566 
Mean number of items answered 

correctly 36.8 36.4 
Median score 37.4 37.0 
8. D. 5.3 5.9 
KR-20 .75 .78 
Mean point biserial r .29 .30 
8. Е. of measurement 3.05 3.07 


Thus, within the limits of the subject matter and instructional 
objectives reflected by the test items, and the kinds of statistical 
analyses carried out, the results of this study confirmed the prac- 
tical benefits of three-choice items while maintaining both the re- 
liability and validity of classroom achievement tests: increase in 
the efficiency with which information can be assessed (students 
can generally complete three-choice items more rapidly than four- 
choice ones) and a less difficult and time consuming job for the 
teacher. Furthermore, although changing four-choice items to three- 
choice items by randomly discarding distractors does not seem to 
hurt a test statistically, and has the advantage of reducing reading 
time, selective abandonment of distractors would presumably help 
the test statistically. In the latter case one would still have the ad- 
vantage of time reduction, a potentially reliable test, and one of 
comparable validity provided that instructional objectives were 
appropriately sampled. 

Theoretical explanations for the fact that three-choice items did 
аз good a job as four-choice items need to be determined. Contrary 
to an interpretation offered in the investigator’s previous study 
(Costin, 1970), such explanations probably lie more in the psy- 
chological than the statistical realm. For example, if one assumes 
that “chance” or “blind guessing” is an important factor in ae- 
termining students’ responses to achievement test items, one would 
expect four-choice items to be relatively more difficult than three- 
choice items. However, as Table 1 shows, the mean number of 
items answered correctly was practically identical for the two 
tests. It is highly likely that students do not guess “blindly” 
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when faced with a classroom test whose answers they “don’t 
know.” It is more likely that instead they choose on the basis of 
cues derived from the items themselves. Use of such cues could well 
be a greater threat to the reliability and validity of an achieve- 
ment test than would the number of alternatives per item. 

What are these cues? How do they operate to produce the kinds 
of results described in this study? To what extent is the use of cues 
in test-wise strategies a function of the number of item options? 
Further investigations are needed to answer such questions, as well 
as to determine whether the practical benefits of three-choice items 
apply to other kinds of subject matter and instructional objectives. 
Such studies should also be expanded to include comparisons of 
five-choice items and perhaps two-choice items as well, so as to as- 
certain in an empirical and a comprehensive way the relationship 
of the, number of item alternatives to the reliability and ultimately 
the validity of objective-type achievement tests. 
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ANSWER CHANGING ON OBJECTIVE TESTS: 
SOME IMPLICATIONS FOR TEST VALIDITY 


STANLEY 8. JACOBS 
University of Pittsburgh 


Tum reluctance of students to change their responses to objective 
test items may appear illogical to anyone who views the test- 
taking process in a simplistic manner. Since the objective in vir- 
tually all achievement testing situations is a maximum test score, 
it would certainly seem advisable to change one's response(s) 
whenever the change(s) will contribute to that score. However, 
the decision as to when to change a response hinges on a highly 
subjective “degree of belief” in the correctness of an item option. 
This belief, which probably is the result of a highly personal 
weighting of many factors, would probably show a great deal of 
variability across an apparently homogeneous group of subjects. 

Surprisingly, the research evidence on the question of answer- 
changing is scanty and often methodologically deficient. For in- 
stance, although the belief among students (and instructors) that 
“first impressions are best” seems widespread, apparently the only 
published deliberate survey of student opinions appeared over 40 
years ago (Mathews, 1929). Although his data indicated that 
students felt answers should not be changed, an examination of 
their answer sheets revealed that answer-changing should be en- 
couraged, since the typical result was an improvement in test score. 

A number of studies on the effects of answer-changing (Lehman, 
1928; Jarrett, 1948; Reile and Briggs, 1952; Bath, 1967; Reiling 
and Taylor, 1972) have concluded that there is a relationship be- 
tween total test scores and types of changes made. That is, better 
students gain more than poorer students when answers are changed. 
One must be aware of a possible tautology, however, since the strat- 
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ifying variable (total test score) has, in most cases, been simply 
the summation of item scores which were affected by the changes 
made (see, for example, Reiling and Taylor). 

To compound the problem of insufficient and/or deficient data, 
writers on the topic of test-taking behavior are divided concerning 
the advisability of answer-changing. Huff (1961), in a popular 
guide for test-takers, implied that it is usually inadvisable to 
change one’s mind. Millman, Bishop and Ebel (1965), however, 
suggested that the tendency to evaluate and judiciously to change 
one’s item responses is a basic aspect of test-wiseness. 

Previous empirical studies have usually involved ex post facto 
analyses of answer sheets for the outcomes of erased responses. 
This approach can, of course, detect only overt answer changes, 1.е., 
the only trace of the decision process is the erasure. There is no 
possibility of detecting the several decisions which may precede 
putting pen to paper. Also, the lack of control over student responses 
has further compounded the interpretations of erasures. For in- 
stance, the typical study has been unable to account for individual 
differences in response strategies—the degree to which students may 
work “back and forth” among items as they are looking for and 
utilizing cues and clues from one item to answer another. The un- 
certain reliability and validity of the data sources in previous 
studies led to the development of an experimental situation where 
some control could be maintained over the responses to test items, 
and where answer changes could be readily identified. 

Method—Subjects. The sample of 50 subjects involved in the 
present study was an intact class drawn from the enrollment of 
the introductory graduate course in research methodology at the 
University of Pittsburgh. 

Procedure. In the first week of the term, all subjects completed 
the Quick Word Test, level II, form Am (QWT) (Borgatta and 
Corsini, 1964), a 100-item, 4-option multiple-choice measure of 
general mental ability. Several class meetings after the first course 
examination (dealing with elementary statistical concepts), sub- 
jects were requested to compose a brief note detailing their opinion 
concerning answer-changing on objective tests and ending with а 
statement indicating whether subjects felt the usual net result (to 
that person) was a gain or loss in test score, or whether the result 
was unknown. 
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Approximately four weeks later, subjects completed a course 
examination, dealing with elementary measurement concepts, 
composed of 45 4-option multiple-choice items. The items were 
drawn from a pool of over 100 items for which item analysis data 
were available, so that the test contained 15 easy items (р = .75), 
15 items of moderate difficulty (р = .49), and 15 very difficult 
items (р = .29). All items were positively discriminating, and an 
attempt was made to maintain content validity. Items were randomly 
ordered. 

Items were reproduced singly on 2 X 2 slides. Items with a total 
word count of 25 or greater were produced as black on white slides 
and given an exposure time of 45 seconds. Items with a word count 
of less than 25 were produced as white on black slides and exposed 
for 30 seconds. A preliminary tryout with a similar group of sub- 
jects indicated the time allowances would permit at least one complete 
reading of all items. Subjects were informed of the cues and were 
advised that they would see the slides only once. They were told to 
read the items rapidly but carefully, and to answer all items. 

Upon completion of the slide-administered test, subjects were 
informed that they would have an opportunity to reconsider their 
answers. Black electrographic pencils initially used were collected, 
and subjects received mimeographed copies of the test and red 
pencils. Any changed answers were to be recorded in red without 
erasing initial responses, which allowed the determination of the 
frequency and quality of changed answers. 

Following the collection of data, subjects were thoroughly de- 
briefed; the purpose of the study and the need for deception were 
discussed. 

Analysis. A 2 x 3 x 3 three dimensional chi-square was em- 
ployed to analyze the frequency of changed answers, with the 
following dimensions: 

1. Ability: Subjects were divided at the median of the QWT scores 
into low and high ability groups. 

2. Type of change: Wrong-to-right, right-to-wrong, and wrong-to- 
wrong categories were established for changed responses. 

3. Item difficulty: Items were categorized as being of low, moder- 
ate, or high difficulty. 

A two-way ANOVA for repeated measures was employed to 
analyze net gains realized through answer-changing, as a function 
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TABLE 1 
Frequency of Types of Answer Changes Made to Items of Low, 
Moderate and High Difficulty 
Type of Change Low Moderate High 
Right-to-wrong 41 50 57 
Wrong-to-right 134 184 93 
Wrong-to-wrong 20 58 98 


x? = 68.2, р < .05. 


of subject ability and level of item difficulty. Extreme groups of 
n = 15 were formed for the ability variable. 

A one-way ANOVA was employed to analyze net gains made by 
subjects previously reporting gain, loss, or no decision concerning 
their answer-changing behavior. Because of absences when the ini- 
tial reports were collected, the n for this analysis was 44 rather 
than 50. 

Results. Of the five chi-squares calculated, only two were sig- 
nificant at the .05 level: the x? between dimensions (2) and (3), and 
the total x. Since the other dimensions appeared independent and 
the interaction x? was nonsignificant, the total x?'s significance may 
be attributed to the dependence between dimensions (2) and (3). 
(See Table 1) 

As may be seen in Table 1, there was & marked tendency to 
change incorrect responses to correct responses for low and mod- 
erate difficulty items, with the quality of changes showing a grad- 
ual deterioration as item difficulty increases. Аз one might expect, 
fewest answers were changed for the easiest items, and the amount 
gained was least for the difficult items. (See Table 2) 

Аз summarized in Table 3, when answers were changed there was 
no significant difference in net gains attributable to subject ability, 
but there were significant differences attributable to the level of 


TABLE 2 


Summary of Net Gains Resulting from Changes, for Three Levels of 
Item Difficulty and Two Levels of Subject Ability 


Level of Item Difficulty 
Level of Ability Low Moderate High 
x s.d. ЗЕ в.а. x s.d. 
High 1.0 2.12 2.47 1.81 83. 3.01 
Low 1.47 1.78 3.47 2.72 1.20 2.43 


= 


* 
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TABLE 3 


Repeated Measures ANOVA Summary Table for Effects of Subject Ability 
and Level of Item Difficulty on Net Gain Scores 
(Conservative Test (Winer, 1962]) 


Source df MS F 
Ability (A) 1 4.90 .89 
Error, 28 5.48 
Level of Diff. (B) ‚2 36.58 8.82* 
AXB 2 4.93 1.19 
Errorg 56 4.15 


*p < .05. 


item difficulty. The greatest gains were realized when subjects had 
changed the answers to items of moderate difficulty, the least when 
answers to very difficult items had been changed. Many students 
apparently perceived the outcomes of their answer-changing be- 
havior incorrectly. As seen in Table 4, all groups gained as a result 
of answer-changing, and the differences among groups were non- 
significant. 

Discussion. Although the data of the present study need careful 
qualification because of the unique testing procedure used, it was 
deemed of primary importance first to insure some degree of in- 
ternal validity for the study. 

The data quite clearly indicated that students should be al- 
lowed and encouraged to reconsider and evaluate their responses to 
objective test items. The improvement in test score may be great- 
est on somewhat speeded tests composed of moderately difficult 
items, probably the typical situation. 

Of some interest was the finding that general mental ability, as 
defined, was unrelated to gains made through answer-changing. If 
one attempted to develop a test orientation procedure, following 


TABLE 4 


Summary of Net Gains Over Total Test for Those Subjects Reporting 
a Typical Gain, Loss or No Opinion 


Gain Loss Do Not Know 
И, БРИ шй СЕЕ рае, ш 
n X sd. т х sd. т ї sd. F 
13 6.0 2.9 20 4.8 3.1 11 5.5 4.3 0.56 


_—_——— 

+Scheffé’s test (Winer, 1962), for repeated measures, showed the locus of 
the significant difference to be between moderately difficult and highly dif- 
ficult items only. 
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the Millman, Bishop and Ebel approach, to allow a more valid 
evaluation of naive persons (e.g. the culturally different), these 
data imply the utility of subjects’ careful evaluation of their an- 
swers. 

Relatively little is known about the strategies followed by sub- 
jects in selecting the answer to objective test items. As mentioned 
earlier, the question of whether or not to change an answer involves 
a highly subjective, personal weighing of many factors. An investi- 
gation of the decision process (e.g. by requiring verbal reports of 
the basis for assigning probabilities to item-options) may lead to a 
better understanding of the construct validity of many procedures 
generically termed tests. 
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RELIABILITY OF COLLEGE GRADES AND 
GRADE POINT AVERAGES: SOME IMPLICATIONS 
FOR PREDICTION OF ACADEMIC PERFORMANCE 


ALFRED Е. ETAUGH, CLAIRE Е. ETAUGH, лхо DONALD E. HURD! 
Bradley University 


Tun search for predictors of academic performance in college, 
usually defined in terms of grade point average (GPA), has met 
with limited success. Combinations of cognitive predictor variables 
typically yield a multiple correlation with GPA ranging between 
.50 and .60, with the addition of noncognitive variables generally 
producing only a minimal gain in predictability (e.g., Chansky, 
1965; Fishman, 1962). 

The low validity coefficients obtained are attributed by some 
writers (e.g., Chansky, 1964) to the presumed low reliability of 
the GPA criterion. Conclusions regarding low reliability of the 
GPA usually are based on observations of diversity in grading 
practices, rather than on direct computation. The authors located 
only one attempt to determine the reliability of the GPA (Clark, 
1950). Unfortunately, the study is of limited value, since only se- 
lected students and courses were examined, and a possibly biased 
reliability estimate was used (Ebel, 1951). 

The equivalence between the analysis of variance (ANOVA) 
model and the standard reliability formulas, demonstrated by 
Hoyt (1941), permits the assessment of the reliability of grades 
and their averages in situations in which not all students are rated 
by the same set of graders (Ebel, 1951; Winer, 1962), a situation 
for which the standard reliability formulas are not well-suited. The 


—_—_ 
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present study explored several methods of assessing the reliability 
of college grades and GPAs using the ANOVA model. 
Method—Subjects. The sample consisted of all 4,288 under- 
graduate students and 215 graduate students who had completed 
two or more courses during the spring semester of the 1970-71 
school year at Bradley University. 
Procedure. The ANOVA reliability formula for averages is 


тд = 1 — MSE/MSB, (1) 
and for single grades is 
rı = (MSB — MSE)/(MSB + (c — 1)MSE), (2) 


where MSB is the mean square between students’ GPAs, MSE is 
the mean square error within students, and ¢ is the average number 
of courses entering into the computation of the GPA. Snedecor 
(1946) recommends the following approximation to the harmonic 
mean as the appropriate value for б: 


в = (1/N — (> с; — Xe а), (3) 


where N is the number of students, апа су is the number of courses 
for the ith student. 
The credit-hour weighted GPA (WG) is 


WG = > ћу Хаи ћи (4) 


where Xy is the grade for the ith student in the jth course, hy is 
the credit-hour weight in that course, and A, is the sum of the 
eredit-hours for that student. 

If low reliabilities arise mainly from differing standards in vari- 
ous courses, as asserted by Chanksy (1964) and others, then the 
substitution of the standard score (Z;) transform for each course 
grade (Ху) will remove the between-grader variance from the 
error term and should raise the reliability. 

By setting ћу = 1 in formula (4), an unweighted GPA is ob- 
tained for which the MSB and MSE depend only on the grades 
assigned. The resulting reliability estimate should be higher than 
that obtained for the weighted GPA, since any extraneous vari- 
ance contributed by the credit-hour weights must end up in the 
error term. 


ТАВГЕ 1 
Statistical Characteristics of the Sample 
Е. 
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School Year 
Freshman Sophomore Junior Senior raduate 
Statistic* (М = 1220) N. = 1032) (N = 1006) (N = 1030) @ = 215) 


Mean number of 
courses (2) 
WG-Mean 


н оту onon 
РЕЗО 
ہن سر ن سر‎ 
EEE 
= 
Y 
соо 
ре рш 
БЕКЕ 
© 
со 
л 


UG-SD 1.39 


Note.—GPA calculated on ie 8.00 system. 
* Symbols are defined in tex 


These considerations led to the assessment of the reliability of 
the following measures: (1) weighted GPA (WG); (2) weighted 
standard score average (WZ); (3) unweighted GPA (UG); (4) 
single grade, derived from WG (S-WG); (5) single grade, derived 
from UG (S-UG). 

Results. Table 1 summarizes some statistical characteristics of 
the sample. The means and SDs of weighted and unweighted 
GPAs are quite similar. The mean GPA rises steadily from the 
| freshman to the graduate level, while the SD and the average num- 
ber of courses tend to decrease. 

The reliabilities of the GPAs and single grades for each school 
year are presented in Table 2. All the GPA reliabilities decrease 
over school years. The stability of the single-grade reliabilities in- 
dicates that the decrease in GPA reliabilities is mainly the result 
of the decrease in the average number of courses taken at the up- 

} per levels. 
| The fact that reliabilities of WG and WZ are essentially the 


TABLE 2 
Reliabilities of Grade Point Averages and Single Grades by School Year 


Type of School Year 
reliabilitys Freshman Sophomore Junior Senior Graduate 
oomen. error аст REA rro але REIS" 


WG 689 625 605 528 371 
WZ 686 634 598 552 310 
UG 806 786 757 726 624 
8-WG 297 263 244 195 186 
5-06 443 440 396 364 410 
Note.—Decimals omitted. 
* Symbols are defined in text. 
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same thus fails to support the idea that а major source of error 
variance results from differing standards among graders. The Z- 
transformation must affect both the between- and the error- 
variance proportionately; that is, not only are the biases of hard 
and easy graders eliminated, but any real differences in student 
performance in different courses are suppressed. 

As expected, reliabilities for unweighted grades and GPAs are 
higher than those for their weighted counterparts. They are also 
more consistent over school years, particularly in the case of S-UG, 
which is unaffected by the mean number of courses. This un- 
weighted single-grade reliability therefore constitutes the best es- 
timate of agreement between graders over school years. 

Discussion. There appears to be little justification for authors of 
future validity studies to speculate about the low reliability of the 
GPA criterion. The present findings show that the reliability of the 
GPA depends on three factors: the number of course grades, the 
average reliability of these grades, and the credit-hour weights. 
Adequate reliability for the GPA can almost always be assured by 
the simple expedient of including a sufficient number of courses. 
For the present data, for example, a reliability of .90 for the un- 
weighted GPA would be achieved by including an average of ap- 
proximately 14 courses, as estimated by the Spearman-Brown 
“Prophecy” formula. 

Studies attempting to predict weighted GPAs are accepting a 
handicap in the form of the credit-hour weighting system, which 
attenuates the reliable variance in the GPA criterion. This weight- 
ing system is imposed primarily to satisfy bookkeeping require- 
ments (e.g., hours cumulated towards graduation, cost per unit 
class hour) that are peripheral to the problem of student evalua- 
tion. The present findings suggest that unweighted GPAs will yield 
somewhat higher validity coefficients than weighted GPAs. 

According to classical test theory, the square root of the reliabil- 
ity is the upper limit of the validity coefficient, which, in terms of 
the present results, would imply that a substantial portion of re- 
liable variance in GPAs is not predicted. However, conclusions 
based on classical test theory are customarily appropriate for uni- 
variate (factorially pure) measures, but not for factorially com- 
plex measures such as general aptitude and achievement tests and 
GPAs. High reliability does not guarantee validity if a large por- 
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tion of the reliable variance is unique variance. Furthermore, most 
validity studies indicate that combinations of reasonably reliable 
cognitive variables predict college grades nearly as well as do col- 
lege grades themselves. As an illustration of these points, Mehrens 
and Rogers (1970) data yield a correlation of .64 between fresh- 
man and sophomore unweighted GPAs, despite GPA reliabilities 
in the .808. 

Although the application of more powerful multivariate tech- 
niques may enable the investigator to account for certain com- 
ponents of scholastic performance variance more accurately, there 
is probably little reason to expect that the search for new predic- 
tors of the overall performance represented by the GPA will be 


very fruitful. 
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THE USE OF PATTERN ANALYSIS FOR THE 
PREDICTION OF ACHIEVEMENT CRITERIA 
USING MULTIPLE PERFORMANCE MEASURES 


DAVID FRIEDMAN 


Applied Management Sciences, Ine. 
Silver Spring, Maryland 


ApvANCES have been made in recent years in the direction of im- 
proving the methodology of achievement measurement. The pat- 
tern or configural model designed to treat data so as to yield a 
higher degree of predictive ability has been one of the products of 
these efforts. Evidence has been presented that increased accuracy 
of prediction may be obtained if the predictor variables are treated 
as patterns of scores rather than as linear combinations of aver- 
ages of separate independent scores. 

If the criterion is a single composite index, the researcher’s prob- 
lem involves finding that combination of variables which yields the 
best prediction of the criterion involved. For most studies, pattern 
analysis as applied to item responses with a single test have 
yielded no better discriminations than have the more usual addi- 
tive techniques which ignore interitem relationships. Many times, 
however, one is interested in predicting success on а number of 
Performance measures at once, and several approaches have been 
explored that suggest that prediction of multiple criteria can be 
improved by employing pattern scoring of responses as opposed to 
conventional methods. This study examined those approaches. 

Problem. 'This study examined models for improving prediction 
of single and multiple criteria. The prediction formulation for a 
single criterion using simple linear combinations for two variables 
could be stated mathematically as У = 6, + В.Х: + 8X», where 
Y is the criterion, X, and Х, are independent predictor variables 
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and во 8ı, and 8, are the linear variate coefficients. And, similarly, 
for multiple criteria: M. Y, + MYs = Bo + В.Х; + В.Х, where Y, 
and У, are independent criteria, X, and X; are independent predictor 
variables, X,, №, В, and В» are canonical variable coefficients. Now, 
if the derived predietor was & multiplieative combination of the 
previous variables, the Y's would then be predicted by: Y = 8, + 
В.Х, + ВХ, + ВХ, Ху; and МУ, + №У, = Bo + В.Х. + ВХ + 
В.Х.Х,, with the assumption being that the interaction term was 
not zero and contributed a significant effect. 

Based on these formulations, two predictions were made: (1) 
simple linear combinations of predictor variables would perform as 
well predicting a single criterion as would nonlinear combinations 
of the variables; (2) combinations of linear and nonlinear predic- 
tor variables would yield higher correlation with multiple criteria 
than would either alone. 

Methodology. Six hundred eighty-two eighth-grade students 
were administered two instruments: the Parent-Child Relations 
Questionnaire (PCR) (Roe and Siegelman, 1963) and the Cali- 
fornia Achievement Test (CAT) (Tiegs and Clark, 1957). Scores 
for the students on the 10 scales of the PCR were factor analyzed 
using VARIMAX rotation. The analysis yielded three orthogonal 
factors. The factor loadings were used to create three standardized 
factor scores for each student. These factor scores were computed 
by using a technique similar to that of Guilford and Michael 
(1948). The scores served as the independent variables for the 
study. There were six achievement scores for each student on the 
CAT, and these scores were the criteria for the study. The subject 
pool was divided in half for purposes of cross-validation. 

To determine the accuracy of the prediction about a single cri- 
terion, linear and nonlinear combinations of the PCR factor scores 
were used to predict in turn each of the six CAT scores using mul- 
tiple regression techniques. Following the work of previous efforts, 
the factor scores were also divided at the mean and assigned binary 
scores of +1 or —1. These binary scores were then used in com- 
bination to predict the criteria for purposes of comparison of re- 
sults found using the continuous predictors. The prediction about 
multiple criteria used the same continuous predictors, but were 
analysed using canonical correlation techniques. 

For single criterion prediction, the contribution of the various 
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combinations of the predictors toward increasing the correlation 
between the independent variables and the criteria was deter- 
mined by testing the significance of the multiple Е? when pre- 
dictors were added to or deleted from the model. Such a procedure 
had no direct analogy when multiple criteria were being predicted. 
For the latter case, the number of canonical correlations which 
were significant were determined by using various combinations of 
the predictors. 

Results. The finding that there were no fewer significant correla- 
tions when the nonlinear predictors were removed from the model 
was considered to be evidence that nonlinear combinations made 
no additional contribution to the prediction model. The weights 
obtained for the predictors in both prediction situations were de- 
termined to be stable during cross-validation. 

The results of the study supported the prediction that simple 
linear combinations of variables would probably be the best pre- 
dictors of a single criterion. Taken together with the results of 
other efforts, it would seem that further attempts at trying to im- 
prove prediction of a single criterion by other than linear models 
are unwarranted. 

For predicting multiple criteria, the results seemed to discon- 
firm the prediction that combinations of linear and nonlinear 
predictors would perform better than either alone. The highest 
correlations were achieved with simple linear combinations. How- 
ever, a factor analysis of the criterion variables (the CAT scores) 
showed that what appeared to be multiple criteria was actually a 
single composite criterion. 

Discussion. Given this finding, it was concluded that the predic- 
tion stated above has been neither confirmed nor disconfirmed. In 
order to test the prediction, one would need to have available a set 
of multi-factor criteria. A suggested combination of these criteria 
might be a set of noncognitive variables, such as the scales from 
the College Student Questionnaire concerned with satisfaction with 
a number of aspects of the college environment. 

Two considerations must be included in a discussion of predicting 
multiple criteria. First, if these criteria are linearly related, it may 
well be that the best prediction of those criteria will rise from a 
simple linear combination of the predictors. Second, if the multiple 
criteria exhibit nonlinear relationships, further investigation of 
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the prediction model is needed to determine the most appropriate 
combination of predictor variables. 
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PREDICTING THE COLLEGE SUCCESS OF NON HIGH- 
SCHOOL GRADUATES WITH THE TESTS OF GENERAL 
EDUCATIONAL DEVELOPMENT 


AMIEL T. SHARON 
Educational Testing Service 


Tux tests of General Educational Development (GED) were 
developed in 1942 by the United States Armed Forces Institute 
in order to provide the veterans of World War II а means to 
readjust to civilian life as they resumed their educational and 
vocational plans. The GED tests provide the non high school 
graduate an opportunity to obtain а high school equivalency 
certificate which is generally accepted as a regular high school 
diploma by institutions of higher education, business organizations, 
and civil service commissions. 

The GED battery consists of five tests in the areas of English, 
Social Studies, Natural Sciences, Literature, and Mathematics. The 
tests are designed to measure knowledge acquired in the typical 
general educational programs offered in secondary schools. Rather 
than emphasizing knowledge of details, the tests concentrate on 
the ability to generalize concepts and ideas, to comprehend exactly, 
and to evaluate critically. The tests also seek to determine the 
extent to which informal educational experiences have had a long- 
term impact equivalent to that which might be the result of a 
good formal education. Thus, by means of these tests, individuals 
who have not formally completed their secondary school education 
may be certified as having the equivalent of a high school diploma. 

Previous investigations of the GED had a number of limitations 
which made it difficult to evaluate the tests’ validity for admis- 
sion of non high school graduates to higher education. The 
studies were invariably conducted within single institutions—a 
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circumstance thus limiting the number of subjects that could 
participate in any given study. Furthermore, because of institu- 
tional diversity in populations, admission standards, and grading 
practices, it was difficult to generalize across institutions and to 
compare the results of these studies. A third reason for the need 
to reassess the validity of the GED battery is that there has 
been a shift in the GED examinee population since the original 
validity studies were conducted, as more civilians than veterans 
are tested with the battery today. Thus, the major objective 
of this study was to determine the validity of the GED battery 
as an instrument of admission of non high school graduates to a 
variety of institutions of higher education. 

Method. Forty colleges and universities (12 junior, 28 senior) 
provided GED test scores and cumulative grade-point averages 
(which were generally based on a two-year period) for all their 
currently-enrolled students who have been admitted on the basis 
of the GED tests or a high school equivalency certificate. The 
institutions were geographically diverse, not highly selective 
(only one college accepted less than half of its applicants), and 
90 per cent were under public control. Several colleges did not 
provide complete information on all their students. GED scores 
could not be determined for 159 of the 1,367 students who had 
been identified at the participating institutions. GPAs were not 
provided for most of the 390 students who had withdrawn from 
college. 

Each student in the sample was mailed a questionnaire re- 
questing various biographie and demographie information, in- 
cluding information on experiences with the GED, and attitudes 
toward а variety of current social issues. Follow-up postcards, 
requesting return of the questionnaires were sent to most of the 
nonrespondents. Returns were received from 538 students or 39 
per cent of the total sample. 

Based on the questionnaire responses a profile of the non high 
school graduate in college was developed. The average subject 
was & 28-year-old male veteran who learned about the GED pro- 
gram in the armed services. He took the tests in order to be able 
to enroll in a college. He was admitted to a college without any 
restrictions and despite his relatively old age he had little or no 
problem in adjusting to college. His attitudes toward certain 
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academic and social issues were more conservative than those of the 
general college student population. His formal schooling consisted 
of the completion of tenth grade and subsequent withdrawal from 
high school because of the need to earn money. His nontraditional 
or informal education consisted primarily of independent study 
in technical and job-related subjects. He planned to obtain a 
bachelor’s degree and to engage in a business career. 

Results—Total sample. The central prediction system (Tucker, 
1963) was used to pool the data across the participating colleges 
and to compute average validities (i.e. GED correlations with 
GPA). This procedure adjusts for the different grading standards 
at each college and produces relatively stable overall validities. 
These validities were weighted by the number of cases at each 
college. 

The validities found for the five GED tests were as follows: 
English .31, Social Studies 35, Natural Sciences .32, Literature 
36, and Mathematics .31. The validity coefficients, all of which 
were in the .30s, were all significant at the .01 level. Correlations 
of this magnitude generally indicate that the test can be appropri- 
ately used for prediction of college success, Moreover, these 
coefficients are similar to those obtained with other college admis- 
sions tests, such as the Scholastic Aptitude Test, when they are 
used for predicting the success of traditional college students. 

Type of institution. GED validity coefficients computed sepa- 
rately for GED students enrolled in two- and four-year institutions 
are indicated in Table 1. In the case of every test, the validity was 
higher for junior than for senior colleges. Furthermore, the pattern 
of validities was different at the two types of institutions. Social 
Studies was the best predictor in the two-year colleges whereas 


TABLE 1 
Correlation of GED Tests with GPA by Type of Institution 
Two-Year Four-Year 
(N = 211) (N = 594) 
Test r p г Р 
оТ ое БН 
English 33 01 .30 01 
Social Studies E 101 25 01 
Natural Sciences .43 .01 .25 .01 
Literature .40 .01 .34 .01 
Mathematics .39 :01 E :01 


1058 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Literature was the most predictive of success in the four-year 
colleges. Because the two-year institution validities were higher 
than those obtained for the total sample, type of institution 
could be considered а moderator of the relationship between the 
GED and GPA. That is, when making predictions of the likely 
success of the GED student, the type of institution which he 
is attending should be taken into account in the prediction equa- 
tion. 

One explanation for the more accurate prediction of the success 
of nontraditional students at two-year colleges is that the GPA’s 
were generally based on a smaller number of semesters at these 
colleges than at the four-year colleges. If this interpretation is 
correct, then the GED provides a better measure of abilities nec- 
essary for success in initial college courses than for higher level 
courses. 

Age. Since maturity and motivation may play a more important 
role in the college achievement of the older candidate, it was 
hypothesized that subgrouping on the basis of age might raise the 
predictive accuracy of the GED tests. Table 2 indicates that age 
was an effective moderator in the prediction of GPA. The validity 
coefficients for both age groups were all higher than the correspond- 
ing coefficients for the total sample. A comparison of the validities 
of the GED tests of those under age 30 and those age 30 and over 
indicates that there was little difference in the predictability 
of these two age groups. This result was unexpected as it had 
been assumed that the tests would be less valid for the older than 
for the younger candidates. Since motivation had been assumed to 
carry greater weight in the college performance of adults, it was 
hypothesized that ability or previous achievement as measured 
by the GED tests would not be an accurate predictor of success. 


TABLE 2 
Validities of GED Tests for Two Age Groups 
Under Age 30 Age 30 and Over 

Test N T p N i Y) р 
English | 190 .48 .01 153 .51 .01 
Social Studies 190 .36 .01 152 42 .01 
Natural Sciences 190 .85 .01 151 .42 .01 
Literature 190 .49 .01 158 43 .01 


Mathematics 190 .48 .01 151 .52 .01 ° 
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Since the hypothesis was not tenable, the findings were encourag- 
ing for the use of the GED tests with older candidates. 

Prediction of withdrawal. The prediction of who will drop out 
of college is a formidable task. Previous research has failed to 
isolate any ability or background factors which can be used to 
predict withdrawal from college with more than minimal accuracy. 
The prediction problem exists because students leave college for 
a number of different reasons, only one of which is academic 
failure. A further complication is that there is little agreement 
on the definition of a dropout. Some students transfer to other 
colleges while others return to college after being away for a 
semester or more. In the present study, all GED students who 
were not enrolled at the time that the colleges provided GPAs 
were considered to have withdrawn. The 390 withdrawals ac- 
counted for 28% of the total sample: 

The relationship between the GED scores and attrition is in- 
dicated by the following correlations of GED tests with with- 
drawal or nonwithdrawal from college: English .20, Social Studies 
19, Natural Sciences .17, Literature .19, and Mathematics .14. 
Although all correlations were significant, they were quite low. 
Thus, while the GED tests can predict attrition above the chance 
level, many erroneous predictions of withdrawal can result. Never- 
theless, even the modest relationship found between the GED scores 
and persistence in college may be as useful as any measure 
which is currently available in predicting attrition. 

Conclusion. The results of this study suggest that the GED tests 
are useful and appropriate for the admission and guidance of non 
high school graduates to higher education. Subgrouping students 
on the basis of moderator variables such as type of college and 
аве can raise the predictive accuracy of the tests. The validity of 
the tests for predicting withdrawal from college is significantly 
Positive but low. 
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THE FACTORIAL DIMENSIONS OF THE COMPARATIVE 
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FRESHMAN CURRICULA GROUPS 


KENNETH W. ELTERICH, JR. лхо ROBERT К. GABLE 
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THE Comparative Guidance and Placement Program (CGP) 
(College Entrance Examination Board, 1971) is a series of 
tests, questionnaires, and related services designed to meet the 
guidance and placement needs of the two-year college and of the 
Students entering those colleges. Developed by the College En- 
trance Examination Board (CEEB), the battery includes 11 in- 
terest measures and seven aptitude measures. 

The aptitude measures include: Reading, Verbal, Sentences, 
Mathematics, Year 2000 (requiring integrative reasoning), Mosaic 
Comparisons (reflecting perceptual speed and accuracy), and 
Letter Groups (necessitating inductive reasoning in a nonverbal 
context). The interest measures were designed to represent activities 
in Mathematics, Physical Science, Biology, Health, Home Eco- 
nomics, Business, Secretarial, Social Sciences, Fine Arts, Music, 
and Engineering Technology. The interest scores are summarized 
and reported on a scale of 0 (all dislikes) to 32 (all likes); 
& score of 16 indicates an equal number of likes and dislikes. 

Purpose. The present study was concerned with examining the 
construct validity of the 18 aptitude and interest measures of the 
CGP for three diverse college program groups. Lunneborg, Green- 
mun, and Lunneborg (1970) reported that a factor analysis of 
the 1967 version of the CGP presented a meaningful factor struc- 
ture for 687 community college entrants. Of the six factors pre- 
sented, four were defined by the interest measures; two by the 
aptitude tests. Grimaldi, Loveless, Hennessy, and Prior (1971) 
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declared that a principal components analysis with a VARIMAX 
rotation of the 1970-71 version of the CGP yeilded six inter- 
pretable factors for a total freshman class. Two of the factors 
were related to the cognitive tests, and four were related to the inter- 
est measures. However, no factor analysis of the CGP has been 
reported for separate curriculum groups within a community college 
sample. Such investigations are necessary for the use of the in- 
strument with students pursuing different programs. 

The primary purpose of this study was to examine the battery’s 
factorial dimensions for each of three student groups with different 
program goals. The outcomes of the study were of primary con- 
cern for admissions counseling, as it was an attempt to learn 
more about the students seeking and entering the various courses 
of study. This information may help the counselor and student 
make decisions regarding appropriate programs to pursue. 

A secondary purpose was to examine the construct validity of 
the interest profiles for the three programs involved. The hy- 
pothesis that similarities and differences in the CGP profiles 
could be related to particular programs of study was examined. 

Procedures. The CGP was administered to 350 incoming freshman 
students at Manchester Community College in three curricula 
areas: Transfer programs, Business-Technical Career programs, 
and General Studies. 

The first area, Transfer ( = 116), was composed of transfer 
students from the Liberal Arts, and Business Administration 
Transfer divisions. All students tested in this area showed a 
preference for continuing their education beyond the two-year 
level. The second area, Business-Technical Career programs (N 
= 141), consisted of those students interested in a two year pro- 
gram. Curricula included were: Business Administration Career, 
Data Processing, Hotel-Restaurant Management, and the Sec- 
retarial programs. The third area, General Studies (М = 98), 
consisted of those students who had little vocational or educational 
“knowledge.” They were unsure of the type of career to pursue or 
of how much education they desired. Many of these students 
tended later to transfer from this program into a career oriented 
field or into Liberal Arts. 

Data obtained from the three samples were submitted separately 
to image analysis and principal components analysis using both an 
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orthogonal (VARIMAX) and oblique (OBLIQUIMAX) trans- 
formations (Hofmann, 1970) in order to obtain the most mean- 
ingful descriptions of the variable interrelationships for these 
particular samples. The image analysis and the OBLIQUIMAX 
transformation provided the most meaningful results for all three 
samples. 

Results and discussions: Business sample. Table 1 contains 
the selected entries from the primary pattern matrix for all three 
college groups. For the Business Careers sample ten dimensions 
were generated; six were found to be meaningful. 

Factor I was called a General Aptitude dimension as it was 
defined by all the aptitude tests. Individuals scoring highly on 
Factor I exhibited scholastic ability as measured by these tests. 


à Factor II was called Health-Creativity; high scores on this 


factor indicated an interest in the Health-Biology field as well 
аз active participation in discovery. Factor III was labled Bus- 
iness Interest. High scores reflected an interest in the business 
world and in the practical applications of business. Factor IV, 
which was defined by Social Science, suggested an interest in 
history and current events. Factor V, Scientific- Technology, was 
associated with high interest in scientific technology, both in the 
laboratory and in the field. Factor VI as described as а Math- 
| ematics interest and ability factor. 

The intercorrelations of the primary axes indicated that the 


A six dimensions were relatively independent. The highest correla- 


tion was found between Factor V, Science, and Factor II, Health- 
Creativity, (r = .26); both these factors emphasized creativity 
and discovery. 

Transfer sample. The factor analysis of the Transfer sample 
data resulted in 10 dimensions, six of which were meaningful 
(see Table 1). The dimensions for this sample seemed to be 
identified by the interest areas contained in the curricula of the 
transfer programs of study. 

Factor I was identified as a Verbal Aptitude factor, as it 
Tellected verbal content and reasoning rather than general apti- 
tude as in the Business Career sample. Factor II was called 
Biology Interest, as it was defined by scales from the science 
fields and as it was most highly loaded on the Biology scale. 
Factor ШІ was described as Business Interest and Factor IV 
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as Fine Arts; both were defined by those particular interest scales. 
Factor V, Inductive-reasoning and Perceptual Speed, suggested 
that people with high ability in identifying symbolic systems 
and possessing perceptual skills, have a low interest in such things 
аз historical events. Scoring highly on Factor VI, Scientific- 
Technology, indicated interest in scientific discovery and math- 
ematic problem solving activities. 

The intercorrelations of the primary axes for the Transfer 
sample revealed positive relationships between those factors 
emphasizing scientific discovery in biological science activities and 
creativity in fine arts. (Factor II and Factor ТУ, r = 440), and 
between scientific discovery and mathematical applications (Factor 
' П and Factor VI, г = .35). 

General studies sample. For the General Studies sample 10 
dimensions were identified, seven of which were meaningful. The 
primary pattern matrix for this sample and that of the Transfer 
sample were very similar—an outcome which may reflect the 
similarity of course offerings in the two areas. 

Factor I was named Verbal Aptitude, as it was composed of 
the scales consisting of verbal material and verbal reasoning, 
Factor П, Biology Interest, Factor IV, Business Interest, Factor 
У, Scientific-Technology, and Factor VI, Fine Arts were similar 
to Factors II, III, VI and IV, respectively, in the Transfer 
sample both in scale content and in comparative sizes of loadings. 
‘Thus they were given the same factor names, These dimensions 
may be termed “comparable common factors” (Harris and Harris, 
1971). Factor III was called Social Science-Business Interest. 
People scoring highly on this factor seemed to have an interest 
in current events, politics, public relations, and business. Factor 
ҮП was called Non-Verbal Aptitude, as it was defined by the 
Scales reflecting an aptitude in a non-verbal context. Mathe- 
matics was included in this factor; it did not contribute greatly 
to naming the dimension. 

The intercorrelations of the primary axes for the General Studies 
sample indicated that some of the seven dimensions were related 
in expected directions. (Factor II, Biology, and Factor V, Sci- 
entific-Technology, r = .38; Factor II, Biology, and Factor VI, 
Fine Arts, r = 29). A comparison of the factor intercorrelations 
for the General Studies sample with those from the other two 
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samples revealed that, when factors with similar names were 
compared, a definite similarity in the factor relationships was 
present. 

Summary. It appears that meaningful dimensions were measured 
by the Comparative Guidance and Placement instrument for all 
three college program groups. For the Transfer and General 
Studies samples similar factors were identified that contained 
interest areas associated with those curricula, A common Verbal 
factor was also isolated. As compared with the other two 
samples, the Business sample combined more of the CGP scales 
into single factors though several interest areas similar to those 
of the other two samples were identified, along with a strong 
Business Interest factor. 

Interest profiles. A secondary purpose of this study was to ex- 
amine the construct validity of the interest profiles for the three 
different program groups. Mean scores of the 12 CGP interest 
scales were plotted for the three groups along with national 
norms from the 1970-1971 CGP Total Program Statistics. In- 
spection of the profiles, which are available from the author, 
suggests the following conclusions: First, the patterns of interests of 
students in this study were very similiar to the National Norms 
group; second, a priori hypotheses of curricula similarities and 
differences in the interest profiles were substantiated. 

As suggested by the factor analysis, the interest profiles for 
the General Studies and Transfer groups were undifferentiated— 
a finding characterizing their indecision about a major field of 
study as well as the variety of major fields included in these 
categories. In contrast to the Transfer and General Studies groups, 
the profile of the Business students indicated a stronger interest 
in the Secretarial and Business areas than in the other areas. The 
identified similarities and differences in the interest profiles were 
conceptually consistent. It appears that the CGP can identify in- 
terests of students who have elected to pursue different college 
programs. 
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USE OF THE 0-48 TEST, OTIS QUICK SCORE MENTAL 

ABILITY TEST, AND NATIONAL TEACHER EXAMINA- 

TIONS FOR PREDICTING SUCCESS IN A GRADUATE 
SCHOOL OF EDUCATION 


BRAD 8. CHISSOM, JERRY В. THOMAS, AN» RALPH LIGHTSEY 
Georgia Southern College 


PREDICTION of college success using the D-48 as a predictor has 
been investigated for undergraduate populations in the United 
States by Boyd and Ward (1967), Cantwell (1966), and 
Domino (1964). A study by Rafi (1967) used Lebanese college 
males as subjects when predicting college grades with the D-48 
Test. 

The D-48 Test, which is composed of 44 items using dominoes 
as the item format, is described as a nonverbal analogies test 
of general intelligence (Black, 1961). Results have shown only 
moderate relationship between the D-48 Test and college achieve- 
ment with the obtained correlations ranging from .20 to .37 for 
the studies cited. However, the test has shown promise as a 
measure which might make a significant contribution to a pre- 
diction equation when used in conjunction with other measures. 

Purpose and method. The specific objectives of the study were 
to: (a) assess the relationships between the D-48 Test, the Otis 
Quick Scoring Mental Ability Test, Gamma Form AM (008), 
and the National Teacher Examinations—Common Examina- 
tions (NTE) for graduate students in education, and (b) deter- 
mine whether the D-48 Test would be used in conjunction with 
other measures to predict the grade point average for graduate 
students enrolled in masters’ degree programs in the School of 
Education of Georgia Southern College. The group was comprised 
of 68 male and 74 female students with an average age of 29.65 
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years. The D-48 Test, OQS, and NTE were administered to all 
subjects upon their entrance into a graduate degree program. The 
criterion measure used in the study was the cumulative grade 
point average (GPA) based on a 4-point scale with a minimum 
of 30 quarter hours of graduate credit. 

Reliabilities for both the D-48 and the OQS were computed by 
the odd-even, split-half method increased by the Spearman-Brown 
formula, and a КВ» reliability estimate was computed for the 
D-48. Data were submitted to a step-wise multiple regression 
analysis which includes variables in the regression equation in 
the order of their contribution to the overall relationship. In- 
clusion of sex (M = 1, F = 0) and age (in years) as variates in 
the analysis resulted in a total of five predictor variables. 

Results and Discussion. Reliability estimates for the predictor 
variables are contained in Table 1 along with the means, standard 
deviations, and intercorrelations of all the variables. The re- 
liability estimates for the D-48 were consistent with prior re- 
search using adult subjects (Black, 1961, and Boyd and Ward, 
1967). The split-half reliability calculated for the OQS was .84. 
Reliability for the NTE was reported by Cook (1961). 

An examination of the means for the males and the females for 
all predictor variables revealed no significant differences between 
the two groups. Correlation between the D-48 and the OQS was 
44, а value less than the .57 obtained by Boyd and Ward (1967), 
and greater than the .27 reported by Chissom and Lightsey 


TABLE 1 

Intercorrelations, Means, Standard Deviations, and Reliabilities for Variables 
Variables 1 2 3 4 5 6 M SD 
1. Age — -—.07 —.13 —.15 .02 —.08 29.65 7.91 
2. Sex m .03 .05 .00 —.24** .48 .50 
3. D-48 um .44** ;16* .10 26.99 5.54 
4. OQS m .57** .27** 56.94 9.11 
5. МТЕ = .26** 618.32 59.68 
6. GPA = 3.54 .30 
Reliabilities 
(Split-Half) .84 и mE КОМ 
Reliabilities (КВ) .88 .88= .96 — 


* Significant at .05 level—required r = .16, df = 150. 
** Significant at .01 level—required г = .21, df = 150. 
a Otis (1954). 

b Cook (1961). 
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(1971) in studies using the two tests with adult subjects. The 
OQS and NTE shared a greater amount of variance than any 
other pair of measures. However, the D-48 did not correlate 
with either the NTE or the criterion measure to any great extent. 
The best predictors of GPA appeared to be the OQS and the 
NTE, and since they shared a sizable amount of variance it 
was likely that they were accounting for the same part of the 
variance in the criterion. 

Results of the step-wise regression analysis are presented in 
Table 2. The variables are numbered in the order in which they 
were entered into the prediction equation. The total multiple 
correlation, including four of the predictors was 39, which was 
statistically significant beyond the .01 level. Further examination 
of the results indicates that the OQS and sex of the individual 
did predict GPA more effectively than did the NTE score alone. 
In addition, the OQS requires only 30 minutes to administer 
compared to over three hours for the NTE. These factors seem 
to indicate that the OQS could be effectively substituted for the 
NTE as a screening measure for graduate students in education. 
In view of the fact that there were no significant differences 
between males and females in the means of the four predictors, the 
significant correlation between sex and GPA would seem to in- 
dicate some advantage for females over males in obtaining grades 
in graduate school. 

The D-48 was not added to the prediction equation, since it 
made a negligible contribution to the total prediction. It appears 
that the non-verbal nature of the D-48 did not make it a valid 
test for predicting graduate GPA, either by itself, or in con- 
Junction with other measures. 


TABLE 2 
Summary Table for Step-Wise Multiple Regression 
Multiple 
Variable ———— Increase 
Entered R Е? іп А? 
1. OQS .268 .072 .072 
2. Sex .366 .134 .062 
3. NTE .384 .147 .013 
4. Age .891* .153 -006 


*Р = 617 (df = 4137), р. < 01. 
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CORRELATES OF SUCCESS IN PRACTICE TEACHING 
IN MUSIC AT THE UNIVERSITY OF SOUTHERN 
CALIFORNIA 


CAROLE 8. CHADWICK 
California State University, Fullerton 
WILLIAM B. MICHAEL ахо JAMES HANSHUMAKER 
University of Southern California 


Fon a total group of 112 students (58 women and 54 men) at 
the University of Southern California (USC) who had started 
their student teaching assignment in musie between 1960 and 
1970, the major objective of the investigation was to determine 
the criterion-related validity of each of 31 predictors (32 pre- 
dietors in the instance of the total sample) in relation to each 
of four success measures derived from observations (ratings) of 
Personal Qualities (six items) and Professional Competence (10 
items) from the Rating Scale for Directed Teaching (RSDT), a 
form employed by the USC School of Education. The 32 predictor 

f variables, which are cited in Table 1, furnished measures that 
were categorized as belonging to five classifications: (a) per- 
formance and ability in achievement tests (Variables 10 through 
22), (b) academic attainment reflected by grade-point averages 
(GPA’s) in logically related combinations of courses in music 
(Variables 23 to 28, 30, and 31) as well as by overall grade-point 
average (GPA) in all college courses (Variable 29), (c) amount 
of pre-college participation in different kinds of music and 
music-related activities (Variables 5, 6, 7, and 32) and extent of 
participation in music and music-related activities in college as 
reflected by the promise which a student showed in music when 
he was evaluated by his instructors on the Reference Form for 
Directed Teaching (RFDT), a recommendation sheet used by the 


1073 


1074 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


70— 


10— 


@%) 


тс 25 та 60 55 05 20 9r те Ро St 
6т ст IG 8I 65 т л 05 75 TI SI 
20— 0= И [43 88 75 TI и л 80 10 
80 05 6т эт $8 1% 4 т 9r 15 75 
то 7I ST 00 и 50— 0— 00 so 10 20 
то 40 30 10— от 95 £6 40 60 с от 
y0— 10 £0 T0— OI О 10— 50 90 T0— $80— 
95 LE 80 90 05 л 55 ет 05 9r т 
40 05 та 90— 80 20 от 90 90 ст л 
99 OF £y 19 79 [44 78 [2 73 [44 cy 
y0— 21— 60— £1— —6I— 05 1— 905 202 Ll А; 1 болт 
15 65 $5 9r gt €0 8T £I та 81 T6 


Зптаоввәуү-отуәшчуүшгү 

— уво, упошелотцо зтазолтео) 
човчәцәлйшогу 

-Surpuvo1— sso, упәшәләтцәү тталодүвгу 
Ár6[nquo90A. 

-Зитрео}1—5}5Э1, упәшәләгцәү ®талодүвгу 
Speguourepung 

ЧЕН я-—991, uoreogrsse[)) qsr2us OSA 
soeur 

—əmymo peun әлтувдәйоогу 

ерү eurjg—ornj[n[) үвләчәгу әлтүвдләйоогу 
aouaIog—ainyng [вләпәгу әлт}вдәйоогу 
әлпүвдәўүт—эәлиїп{у eruan 9418194000) 
зәре 

[BHog—emyny үвдәпәгу әлтувдәйоогу 


Buryovay, рәјә 10] шло] ооподојо * 
Surgovay, yuepnyg jo Витишдо 3% 03V + 


вида шоу 


БО—вәтАрәү pojspr-omn]y 19430 ` 


sndurep u—senranoy 


peyeper-orenyy qooqog үг 19930 ° 


в@полгу әәпвшцлоуләд 


әйтет—ѕәттлцоү отт} үооцов yar + 


"8l 


үл! 
"9r 
“ST 


PI 
"eI 
"er 
т 


"Or 


5 dw = се 


(9 (о (D Q (9 @ (0) я (9 (0 (0 


по пошом epdureg [930], 


($) 03 (Т) semsvejq чомезыо) 


водзиед Jojorperq 


(ponnug spog тилот) 


мәү 79 рио иэшо М 89 fo szjdumsqng ony, pup 
(ETI = N) Adung 10101 24) 40} (LASA) Визуото popoq 40} poyg Burma oy; шог 
3941802 ур wot4214/) «под fo чор 03 2000 8919017 A 10]02po4qp ONT- Jo ою 40} „вито Прут A. porpjoop-uot25147) 


I ЧОПЧУ 


1075 


ате 
90 
95 
Tr 
88 


££ 
0g 


CAROLE 5. CHADWICK, ET AL. 
аз 


әлә] 90° 48 эопзоуга8!е 10; pe1mbez exe “Ајелтовавох 'gz* рит ‘957 ‘GT jo 5упетотрооо шош jo ejduresqns pus *uouroA jo ојаигввапв 'edures [930 eq? JOT v 


Эт 
80 
9 
ст 


90— 
6t 


90 
Ir 
T6 


152 


18 
152 
[44 
eg 
[44 
60 


38 
ет 


те 


10— 
УТ 


98 


13 
88 
oF 
ТУ 
53 
9r 


18 
95 


[52 


61 
£c 


м 


сс 
eI— 


вт 
БИ 


1— 
OF 


9r 
20 


С^ бы 
80— 


[41 
70 
ет 


LT 
10 


то 


9r 


л 
6т 


At 
д 


xeg 98 
(souesqy 
10 eouose1q) э8эЦоо әлојәд Apnyg ouviq ‘98 
Surqoso, эчәрпї@ 
ur srseqdurg үвлоцгу "8A [eyueumajsu] "pg 
(queumajsug 19430 10 *preoqAe ‘эотюл) 
89100 910394 учәштизвиү [edtoutiq “Eg 
эвэцор e10joq учәштплїүвиү 
Tsdiouniq uo Apnyg OBANA JO S1voX “ZE 
Jooyog &repuooog oy} ur NSN :1eururog ‘Tg 
Tooqog &rejuourepg oy} ut INW :ieurueg "OF 
вэвтор 98000 пу—уао '6c 
se&mor) SPON orm] N—vd? ‘80 
Sesmor) uorjeonp;q [euorssojo1q—vqt) ‘LG 
вәвлпогу 3urjonpuop—vdt) “9% 
вәвтпогу Алоэчт, orenyy—vd? ‘5 
ушәштиуѕит 
тефочиа чо Apnig әуѕзлиа—ү40 "yc 
səsmop эпбтачеэт-—уаю "85 
HOL А 
— то, 3чәшәләгдәоү тотоу) 06 
15101, 
-вотрво — 83591, јпәшәләщцоу зтилорпјво ‘TG 
тој — олуја [e19u0r) 918194007) '05 
s[ejueurepun T-orjoungjtry- 
— 51521, }пәшәләгүәү BIUIOJI[EO ‘61 


penunuoQ—[ әче, 


1076 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


School of Music (Variable 9), (d) presence or absence of study 
of selected instruments (Variables 33 and 35) as well as declared 
emphasis in music teaching (choral vs. instrumental) (Variable 
34), and (e) demographic characteristics of age (Variable 8), 
and, in the instance of the total sample, sex (Variable 36). The 
four criterion measures from the RSDT provided an evaluation 
of observed behaviors in the form of ratings on the following four 
variables which are simply enumerated in Table 1: (1) Elementary 
Student Teaching—Personal Qualities, (2) Elementary Student 
Teaching—Professional Competence, (3) Secondary Student 
Teaching—Personal Qualities, and (4) Secondary Student Teach- 
ing—Professional Competence. 

Methodology. All variables were appropriately quantified. With 
the exception of Variable 33 for which the metric was not mean- 
ingful, intercorrelations of the variables yielded interpretable 
information. Only the criterion-related validities of each of the 
predictor variables relative to each of the four RSDT measures 
for the total sample and for the two subsamples of women and 
men are cited in Table 1. Although brief mention will be made of 
the results from stepwise multiple regression analyses, data will 
not be reported. Intraclass correlation coefficients furnished 
estimates of reliability ranging between .68 and .91 for the four 
criterion measures. 

Findings. The following findings derived from Table 1 may be 
summarized: 

1. For the total sample Variables 9, 23, 25, 26, 28, 29, 30, and 31 
showed statistically significant validity coefficients with every 
one of the four criterion measures. The last seven of the eight 
variables represented direct evidence of level of achievement 
in college work, whereas Variable 9 indirectly indicated 
promise in teaching music, as based upon the ratings of in- 
structors who had had these students in their courses. 
For the total sample as well as for the subsamples of women and 
men, the remaining predictor variables including the stan- 
dardized ability and achievement tests registered few if any sta- 
tistically significant validity coefficients. 
. Although for the two subsamples, the predictor variables (9) 
and (30) yielded comparable validity coefficients, the six 
variables (23), (25), (26), (28), (29), and (31) tended to 
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show relatively higher validity coefficients for men than for 
women with respect to the two criterion measures in the 
Elementary Student Teaching portion of the RSDT, but 
relatively higher validity coefficients for women than for men 
for the two criterion measures in the Secondary Student Teach- 
ing part of the RSDT. Although sex (Variable 36) was not 
significantly related to any one of the four criterion measures, 
there did appear to be an interaction of sex with evaluations 
of performance at the elementary or secondary levels of practice 
teaching as evaluated on the RSDT. 


. As might be expected, these same predictor variables which 


showed statistically significant validity coefficients were the ones 
chosen by stepwise multiple regression analyses, in which result- 
ing composites of two to six predictor variables customarily 
added from .01 to .06 in the proportion of variance accounted 
for in each of the criterion measures over that afforded by the 
single most highly correlated predictor variable. 

Conclusions. The following conclusions evolved from the 
findings: 


. The RFDT, GPA in several logically related groupings of 


music courses representing a broad range of activities, and 
overall GPA in college courses were the most valid predictors 
of success in the practice teaching assignment. Some degree 
of criterion contamination was probably present, however, es- 
pecially in relation to Variable 30, and to some extent Variable 
31, as one or more of the individuals responsible for these 
courses were almost simultaneously also involved in making 
the observations on the RSDT. 

In general, performance on standardized ability and achieve- 
ment tests, participation in pre-college music and music- 
related activities, years of private study on a principal in- 
strument, identification with a principal instrument prior to 
college, past study on the piano (presence or absence), age, 
and sex were not significantly related to success in the 
practice teaching assignment. 


. Composites of optimally weighted predictor variables were 


significantly, though only slightly, more prognostic of success 
in student teaching than were single most valid predictor 
variables. 
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4. Level of attainment in six logically related sequences 
courses was substantially more predictive of practice teach 
success in elementary school music for men than for won 
but considerably more predictive of teaching success in second 
school music for women than for men. 
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RELATIONSHIPS BETWEEN THE CALIFORNIA TEST 
OF MENTAL MATURITY AND THE STANFORD 
ACHIEVEMENT TEST BATTERY 
AT THE PRIMARY LEVELS 


PETER F. MERENDA 
University of Rhode Island 


HARRY 8. NOVACK 
AND 
ELISA BONAVENTURA 
Rhode Island College 


Tur California Test of Mental Maturity (CTMM) and the 
Stanford Achievement Test (SAT) are among the scholastic apti- 
tude and scholastie achievement tests most commonly used in 
American schools. While both purport to be measuring different 
aspects of human potential, there is the inevitable overlap in the 
measurement of these attributes that is associated with the nature 
and content of the item strueture in both tests. This study was 
conducted to investigate the nature and degree of this overlap. 

Method. The sample consisted of 279 first grade and 279 second 
grade pupils drawn from nine schools in three school districts 
in Rhode Island. In the first grade sample, there were 163 males 
and 116 females. The second grade sample was split into 138 
males and 141 females. Form S of the CTMM level 1 was 
administered at both grade levels. Form W, Primary I of the 
SAT was administered in grade 1, and Form W, Primary ll, 
was administered in grade 2. 

Canonical correlational analysis was performed between the two 
sets at each grade level. The objective of the canonieal analysis 
Was to determine the degree of overlap between the CTMM and 
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SAT, and at the same time to determine the nature of the rela- 
tionships between the two sets. 

Results and Discussion—Grade 1. The intercorrelation matrix 
and distribution statistics for the two tests are reported in Table 
1. Both sets of test scores are described in terms of stanines. 
These data reveal that the grade 1 sample was well above 
average in terms of standard scores as compared to the normative 
samples on the attributes measured by the CTMM, and was quite 
variable in the achievement areas measured by the SAT. The 
correlations reported in Table 1 showed that while the SAT was, 
as expected, substantially more homogeneous than the CTMM, 
the subtests of the CTMM yielded higher intercorrelations а 


(т, range from .32 to .55 with median = .41), than would 
normally be expected of a multifactor scholastic aptitude test. 


This result was consistent with a number of findings by the 
authors with the CTMM given to children of this age group. It was 
likely due to the fact that differential abilities are somewhat 
undifferentiated at this low age level. It is significant, however, 
to note that the intercorrelations between the CTMM and the 
SAT were nearly as high (r, ranging from .14 to .60, with 
median = .38) as were the intracorrelations for the CTMM. 
Canonical correlational analysis (Hotelling, 1935) was applied 
to the data of Table 1. The results are summarized in Table 2. 
The data of this table reveal that the first two canonical variates 
were statistically significant (Tatsuoka, 1971). The first canonical 


TABLE 1 
Intercorrelations and Distribution Statistics CTMM and SAT, Grade 1 (N = 279) 


CTMM SAT 

Tests 1089 MSSM SHIA AT 58691.10 X 
(CTMM) 
1. Logical Reasoning 6.67 1.52 
2. Numerical Reason |.55 5.67 1.38 
3. Verbal Concepts |.51 .41 6.37 1.49 
4, Memory [39 .32 .51 а 5.14 1.4 
(SAT) 
5. Word Meaning .38 .46 .31 .19 4.74 2.9 
6. Paragraph Mean. .37 .45 .29 .20 4.09 2.29 
т Vocabulary 42 41 .49 .42 5.69 2.0 
8. Spelling 33 .42 .29 .14 5.44 2.0] 
9. Word Study Skills 40 45 .97 .22 6.56 2.1 
10. Arithmetie 42 160 .38 .31 5.57 1.9% 


T 
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TABLE 2 
Summary of Canonical Correlation Analysis, Grade 1 


Canonical Canonical Chi 

Variate Correlation Lambda Square df 
First .674 0.466 208.70 24 «.001 
Second .333 0.855 42.75 15 «.001 
Third 174 0.962 10.51 8 NS. 


Fourth .087 0.992 2.07 3 NS. 


variate yielded a corresponding canonical correlation of .674; 
the second, .333. An interpretation of these pairs of linear com- 
binations follows. The first canonical variate, accounting for 
45% of the commonly-shared variance obviously reflected a much 
stronger relationship between the two sets of variables than did 
the second pattern of paired combinations which explained 11% 
of the variance. 

The canonical coefficients for each of these two canonical 
variates were multiplied by their respective standard deviations 
in order to yield relative standardized weights for the variables 
in each of the two sets .The pattern of positive and negative 
standardized weights for CTMM vs. SAT linear combinations at 
the level of grade 1 was: 


First Canonical Variate 


(Е, = .674) 
CTMM, Level I SAT, Primary I 

Logical Word 
Reasoning 0.39 Meaning 0.27 
Numerical Paragraph 
Reasoning 1.21* Meaning 0.40 
Verbal Vocabulary 112" 
Concepts 0.53* Spelling —0.13 
Memory 0.34 Word 

Study 

Skills 0.08 

Arithmetic 1.59 


* High positive weight. 


It is not surprising that high positive weights on both batteries 
Were found to be assumed by tests that are judged to measure 
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verbal and quantitative constructs. In the case of these two 
ability factors, it becomes obvious that both the CTMM and 
the SAT measured these abilities to a rather high degree, and 
that the verbal and numerical factors were positively correlated. 


Second Canonical Variate 


(R, = .333) 
CTMM, Level I SAT, Primary I 

Logical Word 
Reasoning 0.12 Meaning —0.18 
Numerical Paragraph 
Reasoning —0.88** Meaning —0.20 
Verbal Vocabulary 1.66* 
Concepts 0.61* Spelling —0.59** 
Мешогу 0.56* Word 

Study 

Skills 0.26 

Arithmetic —0.96** 


* High positive weights. 
** High negative weights. 


The second canonical variate which assumed a substantially 
weaker relationship showed a tendency of the combination of 
high verbal concepts and memory scores on the CTMM with low 
numerical reasoning scores to be related to a pattern comprised 
of high vocabulary scores combined with low spelling scores on 
the SAT. 

Only two significant canonical variates were produced by the 
analysis. Also, the LR subtest of the CTMM as well as the WM, 
PM, and the WSS tests of the SAT failed to attain any sub- 
stantial relative weighting on the two pairs of linear combina- 
tions. Hence, it is apparent that while a considerable overlap 
between the two linear functions was present, it is reasonable 
to conclude that the attributes measured by these tests were 
unique to their respective batteries. 

Grade 2. The intercorrelation matrix and distribution statistics 
for the CTMM and the SAT are presented in Table 3. A some- 
what lower performance on these tests was noted as compared to 
the grade 1 results. 
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A lowering of the intra-CTMM correlations was found (ть 

ranging from .19 to .42, with median = .30). This finding was 

. accompanied by a sizeable reduction in the interbattery cor- 
relations (т, varying from —.03 to .53, with median = 27). 
The intra-SAT correlations which remained quite high, thereby 
attested to the relative homogeneity of the SAT. 

The canonical correlational analysis results for the CTMM, 
Level I vs. SAT, Primary II are presented in Table 4. Again 
two statistically significant canonical variates were revealed. Their 
corresponding canonical correlations were quite similar to those 
obtained by the grade 1 sample. The patterns of these two canon- 
ical variates were as follows: 


First Canonical Variate 


(Е, = .609) 
СТММ, Level I SAT, Primary I 
Logical Word 
Reasoning 0.18 Meaning 0.03 
Numerical Paragraph 
Reasoning 1.37* Meaning 0.35 
Verbal Science & 
Concepts 0.58* Social Studies 0.63* 
Memory 0.19 Spelling 0.13 
Word Study 
Skills 0.96* 
Language —0.71** 
Arithmetic 
Comprehension 0.49 
Arith. Concepts 1.30* 
* High positive weights. 
** High negative weights. 
TABLE 4 


Summary of Canonical Analysis, Grade 2 


Canonical Canonical Chi 

Variate Correlation Lambda Square ај 
First .609 0.526 175.25 32 <.001 
Second .336 0.836 49.05 21 <.001 
Third .224 0.942 16.42 12 NS. 
Fourth .093 0.991 2.37 5 NS. 


ا ا 
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The SAT, Primary II is expanded as compared to the Primary 
I battery. A language test replaced the vocabulary test of Primary 
I; two arithmetic-tests, Arithmetic Computation and Arithmetic- . 
concepts took the place of the single Arithmetic-test; and a Science 
and Social Studies Concepts test was added. 

The first canonical variate yielded a canonical correlation of 
.609 which accounted for 36% of the common variance. As with 
the Primary I battery, at grade 1, the CTMM assumed high 
positive weights on its Numerical Reasoning and Verbal Concepts 
tests. This combination of positive weights was associated with 
a pattern of high positive weights on the SAT comprised of the 
Science and Social Studies Concepts test, the Word Study Skills 
test, and the Arithmetic Concepts test. A high negative weight 
was assumed by the Language test of the SAT. Evidently, what- 
ever attributes that contribute to high performance in language 
ability, as measured by the SAT, were unrelated to the com- 
bination of verbal and quantitative abilities. 

The second canonical variate, the corresponding canonical cor- 
relation of which was found to be .336, accounted for 12% of the 
common variance and was revealed by the following pattern of 
the pairs of linear combinations: 


Second Canonical Variate 


(Re = 336) 
СТММ, Level I SAT, Primary I 

Logical Word 

Reasoning —021 Meaning —083** 

Numerieal Paragraph 

Reasoning —0.47 Meaning 0.13 

Verbal Sciences & 

Concepts 0.10 Social Studies 1.26 

Memory 1.22 Spelling —0.79 
Word Study Skills —0.17 
Language 0.68* 
Arith. Comprehension —0.28 
Arith. Concepts 0.03 

* High positive weights. 


** High negative weights. 
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The weights of this canonical variate showed a tendency of 
the Memory factor as measured by the CTMM to be related to 
an SAT combination of Science and Social Studies concepts 
and Language (positive) and Word Meaning and Spelling (neg- 
ative). 

Components of redundancy. In the preceding discussion, ref- 
erence has been made to the proportion of common variance. 
This reference was to the amount of variance shared by the 
two canonical variates, ie., pairs of linear combinations. Of more 
direct relevance was the proportion of variance shared by the 
two test batteries themselves. Such information was forthcoming 
through redundancy analysis (Stewart and Love, 1968) in the 
solution of the canonical correlation problem. The index of re- 
dundaney is the proportion of variance extracted from the first 
battery by a canonical variate given the availability of the second 
battery and vice versa. A summary of these results is presented 
in Table 5. 

It was noted that the redundant variance represented consider- 
able shrinkage of the already rather low common variance ex- 
tracted by the canonical variates of the two test batteries. The 
first canonical variate for the CTMM at both grade levels re- 
vealed a shared variance of .45 and .37, respectively. However, 
the redundant variance reduced to .11. This outcome meant that 
only 11% of the CTMM variance was explained by the first 


TABLE 5 
Components of Redundancy, CTMM vs ЗАТ, Grades 1 and 2 
Grade 1 Grade 2 
x Proportion Proportion 
Canonical of Total of Total 
Variate Redundancy Redundancy Redundancy Redundancy 
ee ee eee ey, Бас. 
CTMM CTMM 
First 115 .799 113 .780 
Second .020 .139 022 153 
"Third .007 .049 008 056 
Fourth .002 .014 001 007 
SAT SAT 
First .076 .760 .046 .687 
Second .019 .190 .014 .209 
"Third .005 .050 .006 .090 
Fourth .001 -010 .001 .015 
еее ПОСТИ. 
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canonical variate of the SAT. Likewise, less than 10% (.08 and 
.05, respectively) of the SAT variance was explained by the first 
canonical variate of the CTMM. For the successive canonical 
variates, the redundant variances were further drastically re- 
duced, since between 70 to 80% of the redundant variance was 
associated with the first canonical variate. Hence, it was concluded 
that the amount of overlap between the CTMM and the SAT 
was quite negligible beyond the first canonical variate, in spite 
of the fact that the two significant canonical correlations at both 
grade levels revealed similar patterns of linear composites. 
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CONCURRENT VALIDITY OF THE PEABODY PICTURE 
VOCABULARY TEST, DRAW-A-MAN, AND CHILDREN'S 
EMBEDDED FIGURES TEST WITH FOUR YEAR 
OLD CHILDREN 


MARY ELLEN DURRETT 
University of Texas, Austin 
JAMES HENMAN 


Santa Clara County Economic Opportunity Corporation 
San Jose, California 


А large number of prekindergarten programs (1.е., Head Start) 
have emerged throughout the nation to alleviate the imbalance in 
terms of intelligence scores, school readiness, and achievement of 
children between the white middle-class and the low income racial 
minorities. These prekindergarten programs require systematic, re- 
liable, and valid evaluations. In addition to these requirements, 
the evaluations must also be able to be administered in a short 
period of time and be interesting to the children. 

Intelligence test scores have shown a demonstrable success in 
Predicting school achievement (Anastasi, 1969; Lesser, Fifer and 
Clark, 1965). As Kohlberg and Zigler (1967) have noted, despite 
the criticisms leveled against the IQ measure, a child's 10) score 
obtained in a standard situation has more behavioral correlates 
than any other psychological measure. A wealth of reliability and 
validity information exists on the Stanford Binet Intelligence Test, 
Form LM, but it fails to meet the two additional criteria of short 
administration time and interest for many young children from the 
lower socioeconomic level (i.e., Mexican-American). It is for this 
reason that researchers have used alternative tests, i.e., Peabody 
Picture Vocabulary Test (PPVT), Draw-A-Man (DAM), Chil- 
dren’s Embedded Figures Test (CEFT), (Hodges and Spicker, 
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1967; Howard and Plant, 1967). One important question raised by 
the use of these tests is the reliability and validity with preschool 
children. The present paper deals primarily with the question of 
validity. The purpose of this study was to report on the concurrent 
validity coefficients between each of the tests PPVT, DAM, CEFT 
and a criterion measure given by individual intelligence tests 108 
as measured by the Stanford Binet Intelligence Test, Form LM 
(SBLM). 

Та order to provide additional information regarding the validity 
of the PPVT, both the raw scores and the derived IQ scores were 
utilized. Previous studies (DiLorenzo, 1968; Gray and Klaus, 
1965) have pointed out that the use of PPVT IQ scores is not 
advisable. 

Method—Subjects. The subjects were 98 preschool children be- 
tween the ages of three years and four years nine months, with a 
mean age of three years eight months at the beginning of the school 
year from low income, Mexican-American families and middle in- 
come Anglo families. The total sample was subdivided into six sub- 
samples as follows: three experimental groups enrolled in prekin- 
dergarten programs at San Jose State College Child Laboratory 
and three control groups. The experimental groups were: (1) sev- 
enteen middle-class Anglo children, nine boys and eight girls; (2) 
sixteen Mexican-American children from low income families, 
eight boys and eight girls; (3) seventeen children, nine middle- 
class Anglo children and eight low income Mexican-American chil- 
dren. None of the children in the control groups was attending 
school. 

Procedure. Each child was given a battery of tests, including the 
SBLM, PPVT, DAM, and CEFT. The first three tests were ad- 
ministered at the beginning of the school year, and eight months 
later, the children were retested with the same test and also ad- 
ministered the CEFT. At the time of the pretest it was demon- 
strated that a meaningful score on the СЕРТ could not be obtained 
from these children. Each test was given individually, and the pro- 
cedure stated in each test manual was followed carefully. 

Results and discussion. The pretest and posttest mean scores 
and standard deviations for the SBLM, PPVT (raw scores), PPVT 
(derived IQ scores), DAM (standard Scores), and CEFT (raw 
scores) are available on request. 
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The product-moment correlations (validity coefficients) of the 
pretest and of the posttests relative to the SBLM criterion measure 
are presented in Tables 1 and 2, respectively. In general, the cor- 
relations on the pretest measures were higher than those on the 
posttest measures and the correlation for the middle-class Anglo 
children tended to be higher than those for the Mexican-American 
children from low income families. It should be acknowledged that 
the narrow age range of the subjects (21 months) would provide 
relatively conservative estimates of the correlations as compared 
with those in some studies. However, this type of research, evalu- 
ating preschool programs, has inherent in it the necessity of work- 
ing with limited age ranges. The correlations in the combined group 
between the criterion variable SBLM and each of the PPVT, 
DAM, CEFT measures were all statistically significant beyond the 
01 level at the time of both pretesting and posttesting. 

The intercorrelations of the combined group pretest and posttest 
scores for the combined group (total sample) are presented in Ta- 
ble 3. The intercorrelations between the SBLM, PPVT, DAM, and 
CEFT were all statistically significant beyond the .05 level. 

Regarding the use of the PPVT derived IQ scores the results of 
this study tended to support the finding reported by DiLorenzo 


TABLE 1 
Pretest Correlations between the Criterion Measure Stanford Binet Form LM 
(SBLM) and Scores from the Following Tests: PPVT (Raw and 
Derived Scores), and DAM (Standard Scores) 
o ڪڪ ن‎ 
DAM 


Group N РРУТ (Raw) PPVT (Der) (Standard) 
Homogeneous А 
І ЕМСА 17 .50* .52* р 
Il ELIM 16 172** EUN $m 
IV CMCA 16 Mise .73 0 
V CLIM 16 .30 27 М 
Heterogeneous 
Ill EH 17 .76** 68** 
VI CH 16 .76** .63 @ 
Combined 
Groups 98 ‚72** .58** .41** 


i i ; (1) ELIM—Experimental Low 
Note—() EMCA—Experimental Middle-Class Anglo: ( e 
О hn TD EH Experimental Heterogeneous; (IY) CMCA- Control Middle-Clase 
Anglo; (V) CLIM--Control Low Income Mexican; (VI) CH—Control Heterogeneous. 

*p < .05. 


*p <.01. 
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TABLE 2 


Posttest Correlations between the Criterion Measure Stanford Binet Form 
LM (SBLM) and Scores from the Following Tests: 
PPVT (Raw and Derived), DAM (Standard), CEFT (Raw) 


PPVT PPVT 


Group N (Raw) (Der) DAM CEFT 
Homogeneous 
I EMCA 17 .10 .30 41 .37 
П ELIM 16 .19 .35 .23 7o1* 
IV CMCA 16 45 .54* We .50* 
V CLIM 16 .31 .26 .42 .03 
Heterogeneous 
III EH 17 .46 .56* -33 .61** 
VI CH 16 .53* .03** .56* .06 
Combined Groups 98 .52** .57** .58** 485" 


Note.—(I) EMCA—Experimental Middle-Class Anglo; (II) ELIM—Experimental Low 
Income Mexican; (III) EH—Experimental Heterogeneous; (ТҮ) CMCA—Control Middle-Class 
Anglo; (У) CED Ра Low Income Mexican; (VI) CH—Control Heterogeneous. 

p < 05. 

**p < 01. 


(1968). Although the PPVT could be used in assessment of young 
children, it should not be relied upon to give a valid measure of 
the child's IQ. 

Summary. Ninety-eight preschool children (mean age, three 
years eight months) were tested to determine concurrent validity 
for the PPVT, DAM, and CEFT. The criterion variable for the 
estimation of the validity was the Stanford Binet Intelligence Test, 


form LM. The findings revealed many significant validity coeffi- 
cients. 


TABLE 3 


Pretest and Positest Intercorrelations for the Following Tests: 
SBLM, PPVT, DAM, CEFT*** 


ў РРУТ РРУТ 
Test М = 98 SBLM (Raw) (Der) DAM CEFT 


SBLM = .52** .57** .53** 43** 
РРУТ (Ват) .72** 29 .82** .29** .40** 
PPVT (Der) .58** .70** = .28** :30** 
DAM ML .22* .30** XS p 
жр < .05. 
жр < .01. 


*** Protest intercorrelations are shown in the lower left, 5 3 
the upper right of table. » and posttest intercorrelations are in 


TA 
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CANONICAL VALIDITY OF PERCEPTUAL-MOTOR 
SKILLS FOR PREDICTING AN ACADEMIC 
CRITERION 


BRAD 8. CHISSOM, JERRY В. THOMAS, лхо JUDSON BIASIOTTO 
Georgia Southern College 


Tum relationship between the perceptual-motor and intellectual 
domains has been the subject of numerous investigations. Typi- 
cally, these investigations have used univariate or multiple cor- 
relations as the technique for establishing the relationship. Be- 
cause the variables within each domain are complex in nature, a 
More useful approach to establishing such relationships would be 
canonical analysis. Canonical analysis allows for maximizing the 
relationship between sets of variables by weighting each variable 
Within the set according to its contribution to the overall relation- 
ship, 

Previous research (Thomas and Chissom, 1972) conducted in 
this area has indicated that a more significant relationship be- 
tween the perceptual-motor and intellectual domains occurs at 
the kindergarten level and diminishes as children mature. Through 
canonical analysis, this study investigated the relationship be- 
tween perceptual-motor abilities as defined by the Shape-O Ball 
Test (Thomas and Chissom, 1972) and the Frostig Developmental 
Test of Visual Perception (Maslow, Frostig, Le Fever, and Whit- 
tlesey, 1964) and intellectual abilities as defined by a complex 
teacher rating scale used in previous research studies by Chissom 
and Thomas (1971b) and by Thomas and Chissom (1972). 

Method—Subjects. Subjects for this study were 38 kindergarten 
children (23 boys and 15 girls) from the Marvin Pittman Labora- 
tory School at Georgia Southern College. The makeup of the stu- 
dents in this school was designed to reflect the community popula- 
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tion. The mean age of the participants was 68.3 months at the time 
of test administration. 

Instruments. The Shape-O Ball, (which is a commercial product 
of Dart Industries, Inc.) consists of a hollow plastic sphere 6” in 
diameter with different geometrically shaped holes in the surface 
of the sphere. Plastic geometric pieces matching the holes are in- 
serted into the sphere by the examinee as rapidly as possible. The 
subjects completed four timed trials of the test. The performance 
score was the sum of the four trials. A split-half reliability estimate 
of .96 was obtained. 

The Marianne Frostig Developmental Test of Visual Percep- 
tion consists of five subtests (Eye Motor Coordination, Figure 
Ground, Form Constancy, Position in Space, and Spatial Rela- 
tions) designed to measure separate visual-motor areas. A previ- 
ous study conducted by two of the authors (Chissom and Thomas, 
1971a) indicated that the Frostig DIT VP adequately defines a gen- 
eral and possibly additional specific visual-motor factors. 

The academic criterion measure consisted of a complex teacher 
rating (TR) in which the teacher rated her students from a high 
of nine to a low of one in four separate areas (reading readiness, 
quantitative, verbal, and listening). A reliability estimate for the 
sum of the four separate TR, calculated by Cronbach’s Alpha, was 
.96. 

Results and discussion. Two canonical analyses were conducted. 
Results of the first canonical analysis between the perceptual mo- 


TABLE 1 


Means, Standard Deviations, Z-Score (Beta) Weights, and Canonical 
Correlation for the Shape-O Ball, Frostig DT VP Total Score, 


and Teacher Ratings 
ي‎ 
A Z-Score 
Variables M 8. (Beta) Weights 
Perceptual-Motor 
Shape-O Ball 429.79 189.54 —.927 
Frostig DTVP Total 44.4 У ; 
Teacher Ratings ү nt ge 
Reading ї 5.84 6 
Quantitative 5.58 131 E 
Verbal 5.97 1:60 .603 
Listening 5.84 1.41 .391 
Re = .70, x" = 25.21, df = 8, p < .01 


ОВ 


f 


> 
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TABLE 2 


Means, Standard Deviations, Z-Score (Beta) Weights and Canonical Correlation 
for the Frostig DI VP Subtests and Teacher Ratings 


3 Z-Score 
Variables M S.D. (Beta) Weights 
Frostig DTVP 
I Eye Motor 13.63 4.10 .003 
II Figure Ground 13.89 4.41 —.037 
III Form Constancy 8.74 3.07 .136 
IV Position in Space 4.45 1.72 .264 
У Spatial Relations 3.76 1.76 426 
Teacher Rating 
Reading Readiness 5.84 1.42 .264 
Quantitative 5.58 1.31 — .358 
Verbal 5.97 1.60 .874 
Listening 5.84 1.41 —.196 


R.=.78, 3136.25, df 420, p <.02 


tor domain (Shape-O Ball Test and Frostig DTVP) and the four- 
part intellectual eriterion (TR) are shown in Table 1. А significant 
(р < .01) canonical correlation (В) of .70 was obtained. From 
observing the Z-score (beta) weights, the Shape-O Ball Test 
(negative weight caused by scoring system) made the most sig- 
nificant contribution from the perceptual-motor domain, while 
reading readiness and verbal ability offered the greatest contribu- 
tion from the intellectual domain. 

The second canonical analysis used the five Frostig ОТУР 
Subtests as predictors of the four-part intellectual criterion. Re- 
sults of this analysis are reported in Table 2. A significant (p < 
02) В, of .78 was obtained. By observing the Z-score weights, it 
can be seen that Subtests IV (Position in Space) and V (Spatial 
Relations) were the major contributors from the Frostig DTVP 
battery while the verbal ability rating was the major intellectual 
contributor, 
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CORRELATIONS AMONG CERTAIN VARIABLES AND 

THE QUALITY OF THE WRITTEN EXPRESSION OF 
THIRD GRADE CHILDREN UNDER STRUCTURED 
AND NONSTRUCTURED TEACHING SITUATIONS 


MARY JO WOODFIN 
California State College, Long Beach 


Turre is little research concerned with the prediction of which 
types of children will write better under different methods of 
structuring the writing situation. IQ level as a predictor of writing 
performance in structured and nonstructured writing experiences 
was found to be significant by Servey (1959), who reported that 
children of average intelligence in grades three through six wrote 
better compositions in a structured situation, while children of 
high intelligence wrote better in an unstructured atmosphere. 
Woodfin (1966) reported that, although high 1Q children generally 
wrote significantly better stories than low IQ children under simi- 
lar writing conditions, there were no significant differences in writ- 
ing ability between high and low IQ groups when low IQ children 
wrote under longer time conditions, and, in some instances, low 
IQ children writing for extended periods of time wrote signifi- 
cantly better than high IQ children who wrote for short time pe- 
11043. Sex as a predictor of quality of children’s written expression 
under differing writing situations was found by Woodfin (1966) to 
be related to writing time assigned: girls did tend to write signif- 
icantly better stories than did boys under short-time conditions; 
there tended to be no significant sex differences under long time 
limits; and boys who wrote under the longer time limits usually 
rated higher than girls who wrote under short time limits. Е 

Purpose. The purpose of the investigation was to determine, for 
а structured and nonstructured teaching method to be discussed, 
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the comparative validity of selected achievement and aptitude 
iables as well as of certain demographic characteristics with eac 
of several criterion measures reflecting writing performance, 
Specifically, validity coefficient for predictors represented by the 
language subtests of the Iowa Tests of Basic Skills (ITBS) (Lind _ 
quist and Hieronymus, 1956), the Gates Advanced Primary Read- 
ing Test total score (Gates, 1961), IQ furnished by total score on _ 
the Lorge-Thorndike Intelligence Tests, Primary, Battery, Level 
Two (Lorge and Thorndike, 1957), an index of socioeconomic sta- 
tus as measured by Hollingshead’s Two Factor Index of Social 
Position (1957), and sex with each of the following criterion meas- 
ures in composition: (a) effectiveness of expression of ideas. 
(McClellan, 1956), (b) effectiveness of organization of ideas 
(Woodfin, 1966), and (c) number of words per composition. These | 
variables are cited in Table 1. | 

Method. Five samples of the written expression of 531 third _ 
grade children in a large metropolitan community in Southern Cal- | 
ifornia were obtained by the investigator every two weeks through | 
using nonstructured teaching methods. Children were free to indi- | 
cate whether they wanted to write and to choose topic, format, and | 
style of writing. They were instructed to write until they had fin- | 
ished something that pleased them. Help with spelling was given _ 
individually as needed. 

The structured samples of writing from these same children were | 
gathered by regular classroom teachers using their usual methods | 
of teaching creative writing. The first structured sample was writ- _ 
ten two weeks before the five trial sessions conducted by the in- | 
vestigator; and the second sample six weeks after the trials. In | 
structured assignments, topics were assigned. А 

Using Hotelling’s (1940) procedure for comparing the signifi- | 
cance of the difference between two correlations, the data were stud- 
ied for significant differences in writing performance under the two | 
ways of motivating creative writing. 4 

F'indings. Correlations between various characteristies of children. 
and writing evaluation measures are given in Table 1. In only two 
instances were there significant differences in predicting writing 
performance under structured and nonstructured writing condi- 
tions: 

1. Initial spelling ability correlated significantly more with 4 
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number of words written under structured writing procedures 
than under nonstructured teaching methods. 

2. Initial ability on the Gates reading average correlated sig- 
nificantly higher with number of words per composition writ- 
ten under structured than under nonstructured teaching meth- 
ods. 

Conclusions. For the most part, initial standing in language, 
reading, intelligence, socioeconomic status, and sex differences did 
not predict writing ability of third grade children any more ac- 
curately for structured than for nonstructured teaching methods, 
although children who did not spell or read as well as others were _ 
able to produce more volume in their writing when working under 
a less rigid atmosphere. This outcome may well carry over into 
other areas of writing with more successful experiences. 

While the method used in this investigation is obviously too 
complicated and time consuming for the average teacher to use in 
determining which children achieve better in differing academic 
areas under his tutelage, development of simple processes to ana- 
lyze effectiveness of the learning of differing types of children un- 
der different teaching procedures seems a necessity. 
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MYERS ACHIEVEMENT MOTIVATION SCALE: A 
VALIDATION STUDY 


L. STEWIN anv V. NYBERG 
University of Alberta 


ACHIEVEMENT motivation as a prime determinant of man's in- 
dividual potential has been the subject of many psychological 
queries, both theoretical and psychometric, within the past two 
decades. Subsequent to the classical projective measurement tech- 
nique devised by McClelland (McClelland, Atkinson, Clark and 
Lowell, 1953) various projective and “objective” measures of the 
construct have been devised (Myers, 1965). Recently, the validity 
of both objective and projective measures, including McClelland’s 
use of the Thematic Apperception Test (TAT), has been ques- 
tioned (Hermans, 1970). 

Myers (1965) has devised an achievement motivation scale 
which he maintains is highly satisfactory. Its scores have yielded 
correlations of approximately 0.5 with grade point averages 
(GPA’s) as indices of actual achievement. Other advantages of 
this instrument include simplicity of administration and scoring 
as well as its favorable comparison with results arising from use of 
standard projective techniques (Myers, 1965). 1 

Purpose. The purpose of this investigation Was to obtain evi- 
dence regarding the relationship of the Myers Achievement Moti- 
vation Scale to measures of achievement and ability of junior and 
senior high school students in an economically depressed ates of 
rural Alberta. A comparison of the indices of relationship with 
those of an earlier study was also sought. 

Procedure. The Myers Achievement Motivation Scale was ad- 
ministered to junior high school (grades 7, 8, 9; = 297) and 
Senior high school (grades 10, 11; N = 155) pupils in the fall of 
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1970. Administration and statistical procedures employed were 
similar to those reported by Myers (1965). 

Myers’ (1965) sample consisted of “college bound” high school 
juniors from eastern American high schools. No information was 
given regarding the calculation of GPA’s; however, the Prelim- 
inary Scholastic Aptitude Tests (PSAT)—verbal and mathemat- 
les sections—were employed as measures of intellectual ability. 

In the present study, GPA’s were calculated for junior high stu- 
dents from final grades of the previous year; GPA’s for grade 10 
and 11 students were obtained from grade nine achievement scores. 
In Alberta a battery of external examinations, prepared by the 
Department of Education, is administered to school pupils at the 
end of the ninth grade. The battery consists of six achievement 
tests and the School and College Ability Test (SCAT). Norms, 
based on approximately 30,000 Alberta pupils, were available. 
SCAT verbal and quantitative scores were employed as measures 
of intellectual ability for grade 10 and 11 students; ratings from 
group intelligence tests administered within the school system 
were obtained for junior high students in the sample. Intercorrela- 
tions involving GPA’s, SCAT scores (grades 10, 11), IQ scores 
(grades 7, 8, 9), and Myers’ scale scores were calculated. 

Findings. Results for grades 7, 8, and 9 are presented in Table 1, 
outcomes for grades 10 and 11 in Table 2, and the data of Myers’ 
study (1965, P. 358) аге reproduced in Table 3. 

Achievement motivation correlated significantly with grade 
point average (.29) at the grade nine level only. Correlations with 
ability were not significantly different from zero in any group 
studied. As indicated in Table 3, Myers obtained for high school 
juniors low but significant correlations between achievement mo- 
tivation and (1) PSAT Scores, and (2) GPA's using high school 


TABLE 1 
Intercorrelations of Myers Achievement Motivation Scale, 10, and GPA 


Grade 7 Grade 8 Grade 9 
(N = 95) (N = 83) (N = 119) 


19 СРА 19 СРА IQ СРА 
EE РА 19 GPA 
Motivation — —.08 —.13 .15 213 —.03 .29* 
.57* .34* Р .47* 

* Significant at .05 level. 
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TABLE 2 


Intercorrelations of Myers Achievement Motivation Scale, 
SCAT У, SCAT 0, and GPA 
(Grades 10 and 11 combined) 


SCAT V SCAT Q GPA 
Motivation .07 ll .03 
SCAT V .55* .80* 
SCAT О .78* 


М = 155 (Grade 10N = 84; Grade ИМ = 71). 

* Significant at .05 level. 
juniors. Correlations obtained in the current study (Tables 1 and 
2) strongly suggest that the Myers Achievement Motivation Scale 
was not a valid measure of achievement motivation for the group 
being investigated. 

Discussion. Several differences existed between the sample de- 
scribed in Myers’ study and that of the present study, particularly 
concerning educational aspirations and environmental factors. The 
Canadian (Alberta) sample did encompass & wider age range and 
a greater ability range than did Myers’ sample,—perhaps as а 
function of the “economically depressed" background of its mem- 
bers. The original sample was described as “college bound" high 
school juniors of an Eastern American background. However, con- 
trary to the study findings, greater variance should ordinarily re- 
sult in increased, rather than reduced, correlations (Levin, 1972). 
Minor differences, previously noted in the procedures employed 
to obtain GPA and ТО would not be expected to affect the magni- 
tude of correlations obtained to any significant degree. No logical 
explanation for the discrepancies observed between the two sets of 


TABLE 3 
Intercorrelations of Achievement Mi otivation Scale, 
PSAT and Grade-Point Average 
ç 
Mo. Ue 
Achi ivati E 21 50 
ievement Motivation ( i ( 19) (4 
NEUE ^ (.55) 
.59 
и (62) 


* Values for females are given in parentheses. levels, respectively. 
Соот females are Ped 31 were significant at 05, 01, and .002 levels, id 
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results could be postulated except for the hypothesis that some 
factor contributing to success in high school and college academic 


programs tended to be present in Myers’ sample but absent in the 


Canadian sample. 
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THE RELATIONSHIP OF EACH OF SIX SCALES OF 
THE STUDY ATTITUDES AND METHODS SURVEY 
(SAMS) TO EACH OF TWO CRITERIA OF ACADEMIC 
ACHIEVEMENT IN A COMMUNITY COLLEGE 


DORIS CRANE MILLER 
Pasadena City College 
WILLIAM B. MICHAEL 
University of Southern California 


For a total sample of 280 students in an introductory psychol- 
ogy course offered at a large community college in a middle-class 
Los Angeles suburb and for two resulting subsamples of 138 males 
and 142 females, the purpose of the study was to ascertain the pre- 
dictive validity of each of six factor scales of the Study Atti- 
tudes and Methods Survey (SAMS) (W. Michael, J. Michael, and 
Zimmerman, 1972) with each of two criterion measures: (а) оуег- 
all grade point average (GPA) in at least one semester of college 
work and (b) scores on a common final examination administered 
in the introductory psychology classes taught by five different 
teachers. In addition, the degree of relationship was sought between 
each of the six factor scales of the SAMS, which had been admin- 
istered both as a pretest in September 1971 and ава posttest in 
January 1972, and other variables such as reading skills, prior high 
school achievement (GPA), and age that are enumerated in Tables 
land 2. 

Methodology. Product moment coefficients of correlation were 
calculated among all possible pairings of variables for the total 
sample and the two subsamples. Although not reported in detail, 
Stepwise multiple regression analyses were carried out for the pre- 
diction of each of the two criterion measures (Variables 15 and 16) 
from optimally weighted composites of six SAMS variables alone 
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and from combinations of SAMS variables and other rema 
predictor variables. 

Findings. Inspection of the entries in Table 1 and Table 2 г 
veals the following general results: (a) Although the majority 
the validity coefficients for the six factor scales of the SAMS were 
statistically significant both in the total sample and for the two | 
subsamples, the magnitudes of the correlations were relatively low 
— ап outcome that might have resulted in part from a rather sub- 
stantial restriction of range in the performance of the surviving | 
students on the criterion variables, (b) The posttest SAMS пи 
ures were more highly related to each of the two criterion variable 
than were the pretest SAMS measures, (c) Although exceptions | 
were numerous, the validity coefficients tended to be slightly higher 
for females than for males. (d) Although data are not reported i 
detail, multiple correlation coefficients of optimally weighted com- 
posites of the six pretest factor scales of the SAMS were .31 ant 


423 to 472, respectively, for criterion variables 15 and 16. 
Conclusions. The six factor scales of the SAMS tended to yield: 
low, but statistically significant validity coefficients relative to the 


the six SAMS scales which were designed primarily for counseling | 
purposes had modest predictive validity in relation to the two | 


achievement variables studied whether these scales were used | 
singly or collectively. 
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THE PREDICTION OF MARIJUANA USE FROM 
PERSONALITY SCALES: 


8. D. KNECHT ax» B. P. CUNDICK 
Brigham Young University 
D. EDWARDS ax» Е. К. ERIC GUNDERSON? 
Navy Medical Neuropsychiatrie Research Unit, San Diego 


Druc abuse, once viewed solely as a chronic problem of poor 
minority groups, ghetto neighborhoods, or of alien cultural groups, 
has grown to epidemic proportions through our society. Extreme 
concern has been expressed at many levels of national life, includ- 
ing The White House, and illicit drug use has become a significant 
area of medical, social, and psychological research. The literature 
on drug abuse has failed thus far to describe clearly the person- 
alities of individuals who engage in the use of illegal drugs. With- 
out implying causal relationships, marijuana use is often seen as 
the first step in drug involvement and perhaps can be regarded as 
Part of the initiation rites for intensive drug experience. In any 
case, investigators agree that marijuana use is highly correlated 
with other drug abuse (Kaplan, 1971). 

Implicit in most popular conceptions of the drug user are the 
notions that these persons typically reject the standards and values 
of parents, schools, and traditional authorities and that drug use 
is an expression of rebellion against prevailing social standards. 
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While much anecdotal evidence and casual observation seem con- 
sistent with this view, documentation of this generalization is still 
meager. In addition, a commonly held view, although at present 
entirely speculative, is that drugs are instrumental in reduc 
anxiety among youth in the face of unprecedented rates of social, 
cultural, and environmental change (Louria, 1968). This proposi- 
tion implies that drug users are unusually sensitive and concerned 
about their environment and their fellow-man, a view which is con- 
trary to much clinical observation extant concerning the drug en- 
thusiast, and that the anxiety reduction achieved outweighs the 
fear of legal and social sanctions. The more romantic notion that 
psychedelic experiences are sought because they enhance insight, 
talent, sensitivity, or personality functioning generally has been 
refuted by available evidence (for example, see Frosch, 1969). It 
has been observed in a number of studies that a high proportion of 
persons using drugs have some degree of personality disorder 
(Nail, Gunderson, and Thompson, 1972). 

In an effort to advance understanding of drug-taking and drug- 
oriented subcultures, several investigators, using field interviews 
and clinical observations, have attempted to characterize drug 
users in the psychological and behavioral terms (Cary, 1968; Gold- 
stein, 1966; and McGlothlen and West, 1968). Other investigators 
have employed attitude and personality scales in an effort to 
achieve better standardized and more reliable descriptions of users 
(Comrey, 1970; Hogan, Mankin, Conway, and Fox, 1970; Phil- 
lips and Delhees, 1968). In the Hogan, et al. study, Gough’s Cali- 
fornia Psychological Inventory (CPI) scales were correlated with 
four levels of marijuana use. Marijuana users were reported to be 
more impulsive and nonconforming, and also more adventuresome, 
empathic, and socially poised than nonusers. These results were 
confounded, however, by differences among marijuana use groups 
on other variables, such as fraternity membership, year in school, 
academic major, and scholastic achievement. The relationships of 
CPI dimensions to marijuana use need further clarification. 

Phillips and Delhees (1968) reported that the Cattell Sixteen 
Personality Factor Questionnaire (16 PF) described similar per- 
sonality profiles for drug users in a prison population. The 16 
PF profile reflected characteristics of both clinical and antisocial 
behavior types. The only major study utilizing the Comrey scales 
was that of Comrey and Backer (1970). The most striking cor- 
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relation was between the Social Conformity Scale and self-reported 
marijuana use (r = —.54). Significant correlations also were ob- 
tained between marijuana use and Orderliness (r = —.30), Mas- 
culinity (г = .18), and Trust (r= —.14). 

Psychometric methods appear to offer one useful approach to 
clear definition and description of drug users as a group. In the 
present study, the Comrey Personality Seales (CPS) and the Cat- 
tell 16 PF Questionnaire were utilized in an investigation of the 
personality correlates of marijuana use. No studies were previously 
available which provided evidence of stability of results within a 
cross-validational design. 

One of the basic questions in the investigation of drug abuse is 
the extent to which this behavior represents social rebellion and 
overlaps with other types of nonconformist and delinquent be- 
havior. Etiological understanding and appropriate treatment, 
strategies obviously are dependent upon clear delineation of un- 
derlying psycho-social dimensions. While no single psycho-social 
dimension or simple taxonomy can be expected to encompass the 
wide-range of behaviors linked to illicit drug use, the conceptual 
and practical problems of classification of drug abusers will be 
simplified and enhanced to the extent that existing theory and 
knowledge concerning nonconformity can be brought to bear. 

Method—Subjects. Subjects were 135 undergraduate students at 
Appalachian State University in North Carolina. Subjects were 
participants in a confidential, self-report drug use survey which 
provided the information needed to classify individuals with Te" 
spect to extent of marijuana use. Careful and extensive inter- 
viewing was employed to exclude those individuals who predom- 
inantly used some illegal drug other than marijuana. j 

Subjects were divided into three criterion groups with the fol- 
lowing distributions by sex: (1) Non-use, 41 males and 33 females; 
(2) Occasional use, 20 males and 20 females; and (3) Heavy use, 
20 males and 11 females. The three groups did not differ signifi- 
cantly on age, class level, scholastic achievement, or academic 


major. 

Criterion Groups. The three criterion groups defined by the drug 
use survey data can be characterized more specifically as fol- 
lows: 

(1) Non-use—Members of this group denied the use 


legal drug during the past four months; 


of any il- 
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(2) Occasional Use—Persons in this group reported use of four 
marijuana “joints” or fewer per week and less-than-monthly use 
of any other illegal drug during the past four months, and 

(3) Heavy Use—Members of this category used five or more 
“joints” per week during the past four months. 

For purposes of analysis the criterion groups were assigned val- 
ues of 1, 2, and 3, respectively. 

Procedure. The CPS and 16 PF Questionnaires were adminis- 
tered to all subjects after they had been selected for and assigned 
to criterion groups. The total sample was randomly divided into 
validation and cross-validation samples. Correlations were com- 
puted between the test scales and the scale representing the three 
levels of marijuana use in the validation sample. Also, regression 
equations were developed in the validation sample using a step- 
wise multiple regression procedure, and these regression equations 
were applied to the cross-validation sample. 

Mean scores were computed for the three criterion groups using 
total samples, and the resulting profiles provided comparisons of 
the three groups on the CPS personality dimensions. 

Results. Correlations between test scales and the marijuana use 
criterion are shown in Table 1. The primary validities for the Social 
Conformity (т = —.60, p < .001) and Orderliness (r = —.48, р < 
001) Scales of the CPS closely paralleled the findings of Comrey 
and Backer (1970). A significant correlation also was present for 
the Activity Scale (r = —.34, p < .01); this relationship had not 
been previously reported. 

Of the 16 PF scales, Superego Strength (r = —.46, p < .001), 
Tenderminded (т = .39, р < .01), and Controlled (r = —.27, p < 
05) correlated significantly with the marijuana use criterion. 

When correlations were computed separately for male and fe- 
male portions of the total sample, the patterns of correlations were 
essentially similar for the two groups. The Social Conformity Scale 
was most highly correlated with marijuana use for both male and 
female groups. 

When variables which correlated significantly with the criterion 
were entered into the step-wise regression procedure, only two 
scales, Social Conformity and Activity from the CPS, contributed 
uniquely to the prediction of marijuana use. The multiple correla- 
tion obtained with these two variables in the validation sample 
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TABLE 1 


Correlations between Personality Scales 
and Marijuana Use Criterion 


in the Validation Sample 
Test Scales " 
Comrey Personality Scales: 
Validity 20 
Response bias —09 
Trust — 02 
Orderliness — Ages 
Social conformity —.60*** 
Activity —34** 
Emotional stability – 10 
Extroversion —.18 
Masculinity a 
Empathy 2701 
16 Personality Factor Scales: 
Outgoing xU 
Intelligent 13 
Ego strength = OL 
Dominance ‚16 
Surgency —.00 
Superego strength —.46*** 
Venturesome —.05 
Tender-minded .39** 
Suspicious 18 
Imaginative .21 
Shrewd E 
Apprehensive .19 
Radicalism 18 
Self-sufficient 15, 
Controlled – 7 
м 
= 69. 
*p < .05. 
*p < .01. 
< 001. 


was .64. А cross-validity correlation coefficient of .51 was attained 
with the same two variables. When the Social Conformity Seale 
alone was cross-validated, a correlation of .56 was obtained. This 
result seemed to indicate that in this population the most parsi- 
monious and efficient prediction of marijuana use could be 
achieved using the Social Conformity Seale alone. 

CPS profiles for the three criterion groups are shown in Figure 1. 

Discussion. The results strongly supported the previous findings 
of Comrey and Backer (1970) with respect to a substantial degree 
of association between marijuana use and certain of the Comrey 
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Figure 1. Profiles based on samples of college students who report Non- 
use of marijuana (...); Occasional use (----) and; Heavy use of marijuana 

( ) plotted against profile norms from the Comrey Personality Scales. 


Personality Scales. Furthermore, a fairly reliable prediction con- 
cerning marijuana use could be made based on the single person- 
ality factor represented by the Social Conformity Scale. Thus, 
drug behavior can be scaled on a continuum of no use, occasional - 
use, and heavy use, and members of these classes can be ordered on | 
the CPS Social Conformity Scale. Mean scores for the three cri- | 
terion groups on this Scale were as follows: Non-use, 86.2; Oc- 
casional Use, 72.5; and Heavy Use, 60.6. 

Interview data obtained from many of the drug user subjects in 
the present samples over periods of several months to two years sug- 
gested that the patterns of scores for marijuana users depicted in- 
dividuals who were not actively rebelling against the society in 
which they had been reared, but, rather, these individuals seemed to 
have rejected the mores of that society, have become indifferent to 
its potential rewards, and have become involved in a subculture — 
with different values. 

The ready identification of personality factors which may lead 
to drug involvement, beginning with marijuana experimentation 
and use, and a more nearly complete understanding of the drug 
user, would be beneficial to workers who deal with drug problems 
in counseling, education, and research. The social attitudes and 
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values measured by Comrey’s Social Conformity Scale merit care- 
ful consideration in future studies of drug abuse. 

Summary. The Comrey Personality Scale (CPS) and the 16 Per- 
sonality Factor Questionnaire were used to predict marijuana use 
in a college population. In a validation-cross-validation design, 
correlation and regression analysis revealed that the CPS Social 
Conformity alone provided the most parsimonious and effective 
prediction of marijuana use in the population. The result closely 
paralleled findings reported earlier on the relationship between 
Social Conformity and marijuana use. The predictive power of the 
scale and the implications for counseling and further research were 
discussed. 
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FACTOR STRUCTURE OF THE RUNNER STUDIES OF 
ATTITUDE PATTERNS 


NORMAN M. CHANSKY, ROBERT COVERT, ann LORETTA WESTLER 
"Temple University 


Тнв Runner Studies of Attitude Patterns (RSAP) is а paper 
and pencil personality test developed within a phenomonological 
frame of reference. Its 14 scales measure four factors. Although 
trained users vouch for its predictive validity in industrial and 
educational settings, little is known about the psychometric 
qualities of the RSAP (Runner, 1970). Baggaley, Isard, and 
Sherwood (1970) found RSAP scores to differ in students en- 
rolled in different curricula. Studying its concurrent validity, 
Aberman and Chansky (1970) found subtests of the Cattell 16PF 
to share variance with the RSAP, although the two instruments 
had arisen out of different clinical traditions. The purpose of 
this study was to determine the reproducibility of the RSAP 
factor structure in college samples. 

The four dimensions or orientations of the RSAP hypothesized 
by Runner in his manual were CONTROL, FREEDOM, AFFIL- 
IATION and RECOGNITION. The CONTROL ORIENTED 
dimension includes four subtests: methods dependence, traditional 
righteousness, wariness of people and planful practicality; the 
FREEDOM ORIENTED factor, four subtests: intuitive intro- 
spectiveness, resistance to social pressure, pleasure in tool-skills, 
and active curiosity; the AFFILIATION ORIENTED dimension, 
four subtests: passive compliance, blamefulness, need for affec- 
tional acceptance, and feelings of pressure; and the RECOGNI- 
TION ORIENTATION; two subtests: competitiveness and dom- 


inant directiveness. de. ing Freshmen 
Method. The 121 item RSAP was administered during "res 
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orientation week to 435 female and 358 male freshmen enrolled 
at Shippensburg State College Four groups were formed by 
randomly dividing the answer sheets. There were 172 males in 
one group and 186 in another; there were 222 females in a third 
group and 213 in the fourth. 

Results and discussion. Sixteen measures were factor analyzed 
through using the minimum residual solution (Harman, 1967). 
These were the 14 subtests of the Runner, the sum of the “trues” 
or the total score, and an empirically derived Social Desir- 
ability Scale of the RSAP (Westler and Chansky, 1971). 

Each of the four factor solutions revealed four interpretable 
factors, many of which are similar to those hypothesized by 
Runner. Table 1 presents the significant factor loadings in 


TABLE 1 


Extracted Factors and Loadings above .40 for Male and Female Subpopulations 
————Є—Є——Є—Є—Є—Є—Є——Є——- 
Male Female 
we Group Group Group 
B 


A B 
ل‎ А В 
FACTOR 1 


Total Score .85 77 .72  .89 
A—Feelings of Pressure (Pr) .68 .69 .60 .70 
A—Need for Affectional Acceptance (Af) .63 .63 .48  .09 
F—Intuitive Introspectiveness (In) Ei 1.63 .72  .30 
F—Active Curiosity (Ac) .57 .54 4 .26 
A—Blamefulness (B) .51 .42 47 .46 
P ii ie (Ре) .91  .44 14 .87 

Pleasure in Tool-Skills (Т. 
FACTOR i (T) 40 .43 .16 .21 

"Methods Dependence (Md) .65 .70 .50 .72 
C— Traditional Righteousness (Rt) .62 .69 .60 — .47 
C—Wariness of People (Wa) 455... .67 63.33 
C—Planful Practicality (РІ) .53  .51 58.51 

Total Score +38 .50 -50 .20 

AOR 3 
—Dominant Directiveness (Do) .72 T 
R—Competitiveness (Ох) .53 y р "n 41 
2 1m арро (Ре) —.45 —.27 —.27 —.46 
—Асііуе овњу (Ас 
PH y (Ас) .32 .36 38 .41 

Social Desirability 96 95 

C—Methods Dependence (Md) 195 v ES 14 


Legend—F—Kreedom; A—Affliation; C—Control; R—Competitiveness 
ко Е i #40 are-listed if a loading for the same variable in one of the sub- 
ИТ 

1The authors gratefully acknowledge 
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each of the sub-populations. The first factor identified was com- 
posed of three main elements: the Total Score, with loadings 
ranging from .72 to .89; all of the scales in the AFFILIATION 
dimension, ranging from .14 to .70; and three fourths of the 
scales in the FREEDOM dimension. In addition to the loadings 
of the Total Score, other high loadings were on the Feelings of 
Pressure, Need for Affectional Acceptance, and Intuitive In- 
trospectiveness. This factor suggested that there is a type of 
student who may be described as acquiescent, searching, contem- 
plative, and concerned about the opinions of others yet not 
trusting others. It might be viewed as the synthesis of the 
conflicting "inner" and "other" frames of reference. Could the 
confusion and agitation in the present college generation be due 
to its adoption of internally inconsistent orientations? 

The second factor extracted was clearly a replication of Runner's 
Control Orientation Dimension. It consisted largely of caution, 
dependence on established plans and procedures as well as а 
suspicion of others, and of fundamentalists values. The tendency 
to say YES was present but not to the degree found in Factor 
I. To some this constellation of behavior describes the author- 
itarian personality. 

The third factor replicated Runner’s Recognition Orientation 
but included also measures from other RSAP dimensions. This 
factor depicted the aggressive, assertive person who stands up 
for his own rights, is not easily “led,” and may impose his will 
on others, He is one who strives to rise to the top of the “pecking 
order,” 

The fourth factor extracted was Social Desirability. In large 
Measure it reflected the tendency to say the socially correct 
things. Methods Dependence, the only RSAP Scale which loads 
on this factor, does so inconsistently. 

This study confirmed the existence of two factors reported by 
Runner, namely, Control Orientation and Recognition Orientation. 
A third factor found here, which was composed of two Runner 
dimensions, might be descriptive of а new adolescent, one who had 
adopted both inner and other directed values. Finally, the fourth 
factor, Social Desirability, pertained to the tendency to say the 
socially desirable things. Of particular importance to the internal 
validity of the Runner scales was the fact that the Social De- 
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TABLE 2 
Diagonal Elements in Matrices of Similarity Coefficients 
for Factor Structures of the Subpopulations 


Factor 1 Factor 2 Factor 3 Factor 4 


Male A—Male B .99 .99 „99 „99 
Female A—Male В 97 .98 .95 .98 
Female A—Male A .97 .98 .94 .99 
Female B—Male В .92 .99 .97 .91 
Female B—Male A .93 .98 .95 .90 
Female A—Female В .87 AK .89 .99 


sirability scale did not load оп any other factor. This finding 
signified that unlike many paper and pencil personality tests, the 
tendency to endorse items by saying “True” is not a function of 
their social desirability. 

The present study also indicated that the factors are reproduc- 
ible within the population. Table 2 presents the similarity coef- 
ficients comparing the factor structures of the four groups; male 
A and B and female A and B. In general, the factor loadings 
which appeared in one group were highly correlated with those 
in the other three groups. Specifically, the diagonal elements of 
the male criterion and cross validation groups most closely ap- 
proximated the diagonals of an identity matrix. The groups show- 
ing the least similarity were the two female groups. Nevertheless, 
the factor structure of this group might still be regarded as highly 
similar, Thus, the RSAP factors were reproduced not only in the 
criterion group, but also in the cross validation group as well. 

Summary. The Runner Studies of Attitude Patterns was ad- 
ministered to 793 college freshmen. Factor analyses were conducted 
on two male and two female groups from this population. Four 
reproducible factors were found: the first was inner-other directed 
made up of Runner's Freeedom Orientation and Affiliation Orienta- 
tion ; the second, Control Orientation ; the third, Recognition 
Orientation. А fourth factor, Social Desirability was composed 
primarily of an empirically derived independent scale. To a large 
degree Runner's hypothesized scales were supported. 
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MMPI 0-Н SCALE RESPONSES OF ASSAULTIVE 
AND NONASSAULTIVE PRISONERS AND 
ASSOCIATED LIFE HISTORY VARIABLES! 


CHARLES Н. MALLORY aw» C. EUGENE WALKER 
Baylor University 


AurHoucH efforts to distinguish assaultive persons by means 
of experimental scales of the MMPI have been made by a number 
of researchers, Megargee and Mendelsohn (1962) found that 12 
such indices revealed a very low degree of discrimination, if 
any, between samples of so-called normal subjects and groups of 
extremely assaultive, moderately assaultive, and nonviolent pris- 
oners. Megargee, Cook, and Mendelsohn (1967) developed the 
0-H (Overcontrolled-Hostility) scale that distinguished these 
groups on the basis of 31 MMPI items. They observed that the 
scale appeared to indicate a certain type of assaultive offender 
who not only kept tight controls over his extremely hostile im- 
pulses, but also displayed а pattern of conformity until strong 
provocation led to a breakdown of impulse control and to an 
extremely violent attack upon another. 

Purpose. The major purpose of the present study was to test 
the adequacy of the O-H scale to differentiate groups of prisoners 
determined, on the basis of two explicitly defined criteria, to be 
Overcontrolled assaultives, undercontrolled assaultives, or non- 
violent. The two criteria employed were the type of offense com- 
mitted (assaultive or nonassaultive) and the number of prior 
offenses (no prior convictions vs. one or more prior convictions). 
"These criteria were considered to be a more stringent test of 
the applicability of the О-Н scale in prison populations than that 


—n— t 0 
1This article is based on the senior author's Master's thesis completed at 
Baylor University, 1972. 
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conducted by Megargee et al. (1967). Although assaultive prisoners 
had been classified as overcontrolled or undercontrolled on the 
basis of a number of variables in their biographical and offense 
records, no explicit criteria were stated for decisions concerning 
assignment of subjects to the groups. Such apparently subjective 
selection techniques could possibly have biased the results of their 
study and certainly made replication of the study problematic. 

A secondary purpose of the present study was to determine 
whether O-H scores were related to selected life history variables. 

Method. The subjects for this study were 88 male inmates of 
the Texas Department of Corrections who were age 30 or over 
at the time of commission of the offense for which they were 
currently sentenced. All had IQ's of 75 or higher. Using the . 
criteria noted previously, subjects were assigned to the following 
groups (V = 22 for each group): Group I, men who had com- 
mitted violent murder or assault and had no previous felony 
convictions; Group II, men who had committed violent murder 
or assault and had one or more previous convictions for offenses 
of this category; Group III, men who had committed nonviolent 
crimes of forgery or theft and had no previous félony convictions; 
and Group IV, men who had committed forgery or theft and had 
One or more previous convictions for offenses of this category. 
Subjects were administered an inventory consisting of randomly 
ordered items of the О-Н, F, К, and L scales of the MMPI. 

Results and discussion, Тће table presents the frequency dis- 
tributions, means, and standard deviations for the four prisoner 
groups. An analysis of variance indicated no significant dif- 
ferences in О-Н scores obtained by the four prisoner groups. 
As expected, the mean of the assaultive first offender group was 
higher than the means of the other groups, but the failure of any 
one of these differences to achieve statistical significance indicated 
that the O-H scale could not discriminate adequately between 
this group most likely to exhibit hostile overcontrol and the Te- 
maining prisoner groups. Since the relatively high means of the 
nonassaultive groups on the О-Н scale suggested high hostility 
and strong controls in men who had no records of any personal 
attacks, the validity of the scale was Judged to be in some doubt 
with this prisoner population. 

Further analysis revealed that O-H scores were not significantly 
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TABLE 1 


O-H Score Frequency Distributions of the Prisoner Sample 


———Є—Є—Є—Є—ү————Є——Є—Є—Є—Є———— 


Prisoner Groups* 
Assaultive Nonassaultive 
Raw Male T I п ш IV 
Scores Scores First Offenders Recidivists First Offenders Recidivists 
Boores _ Боогев "БИШКЕ ЫН ЫНЫ ШЕ chee И 
23 88 1 
22 84 0 
21 81 1 2 
20 77 0 0 1 2 
19 13 2 1 2 5 
18 70 2 0 0 4 
17 66 4 2 2 0 
16 63 3 4 5 2 
15 59 2 3 3 3 
и 56 4 1 1 1 
13 52 2 3 2 2 
12 50 1 2 1 3 
п 45 0 1 4 1 
10 42 1 1 0 0 
9 38 0 0 1 
8 35 0 1 1 
T 31 0 
6 28 0 
5 24 0 
4 20 1 
N 22 22 22 22 
Mean 15.68 14.95 14.59 15.04 
[4 2.60 4.16 3.08 3.53 


aF = 040, ај = 3.87, p > 05. 


correlated with age, IQ, educational achievement, or occupational 
level. In addition, chi-square tests of association did not indicate 
Significant relationships between O-H scores and the biographical 
variables of race, marital status, religion, occupation type, or 
military discharge status. 

It should be noted that Megargee, et al. (1967) suspected that 
the amount of overlap between O-H score distributions of their 
overcontrolled and undercontrolled assaultive groups was due to 
inaccurate prison records which may have resulted in the assign- 
ment of previously assaultive men to the overcontrolled group. The 
same qualification of selection procedures must be made in the 
present study. zt 

The lack of statistically significant results when two explicitly 
defined criteria were employed and the lack of association between 
О-Н scores and any of the life history variables suggests that 
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considerably more research is needed before the O-H scale will be 
fully understood or useful in applied situations. 
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THE RELATIONSHIP OF MEASURES OF CONFORMITY 
AND SOCIALIZATION TO GENERAL INTELLIGENCE 
FOR A SAMPLE OF GIRLS IN A PRIVATE SENIOR 
HIGH SCHOOL LOCATED 
IN A CONSERVATIVE COMMUNITY 


ALEXANDER 8. HOLUB ax» JOAN J. MICHAEL 
California State University, Long Beach 


For a sample of 121 senior high school girls in a private 
religiously-oriented school in Orange County, California, a con- 
servative middle-class community, the purpose of the investiga- 
tion was to ascertain the relationship of general intelligence 
as measured by standard scores on the California Short-Form Test 
of Mental Maturity (CTMM), Level 4, to two personality con- 
structs as operationally described by subtests of the California 
Personality Inventory (CPI): Achievement via Conformance 
(AC) and Socialization (SO). It was thought that if a moderate 
relationship could possibly be demonstrated between intelligence 
and either (a) a socialization of (b) an achievement-conformity 
variable one of the most likely possibilities for its occurrence 
would be in a sample similar to the one studied. 

Findings. In Table 1, the intercorrelations among all three 
variables are cited along with their means and standard devia- 
tions. Although the correlation of .70 between the AC and 50 
Measures was substantial, the AC variable was not significantly 
correlated with the intelligence measure. Although statistically 
significant at the .01 level, the correlation coefficient of 24 between 
the SO variable and the CTMM measure accounted for less than 
-06 of the variance in this intelligence measure. 

Conclusion. Thus it was concluded that for the measures em- 
ployed little if any relationship existed between general intelli- 
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TABLE 1 


Intercorrelations of Two Scales of the California Psychological Inventory (CPI) 
and Two Subtests of the California Short-Form Test of Mental Maturity 
(CTMM) Along with Means and Standard Deviations for a Sample 
of Female Students in a Private Senior High School 
in Orange County (N = 121)“ 


Variables (1) (2) (3) Mean SD 
1. CPI—Achievement via 
Conformance (AC) nb 3107 18 21.9 5.4 
2. CPI—Socialization (SO) о Vans .24 33.9 7.6 
3. CTMM—Standard Score (SS) D IE ETE 107.8 13.3 
ee з. 8 


* Correlation coefficients of .18 and .23 are required for significance at the .05 and .01 level, 


gence and either one of the two personality constructs studied 
even in a population in which such a relationship would seem 
to be most probable in terms of sociological expectations. 
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AN INSTRUMENT TO MEASURE VOCATIONAL 
MATURITY?! 


BERT У. WESTBROOK ax» JOSEPH W. PARRY-HILL, JR. 
North Carolina State University 
ROGER W. WOODBURY 
North Carolina Department of Youth Development 


VOCATIONAL maturity has come into fairly wide use as a factor 
of some significance in the vocational adjustment of youth. Re- 
search to date testifies to the importance of the concept, but its 
use has been restricted by the lack of objective, reliable, and 
valid instruments for measuring it. This report describes the 
Cognitive Vocational Maturity Test (CVMT), an instrument de- 
signed to measure career knowledges and abilities within six areas 
of the cognitive domain of vocational maturity, as well as offers 
validity and reliability data. 

Development of instrument. The following six areas of vocational 
maturity were identified through а review of the literature: (1) 
Fields of Work—knowledge of which occupations are available 
in various fields of work. (2) Job Selection—the ability to choose 
the most realistic occupation for a hypothetical student who is 
described in terms of his abilities, interests, and values. (3) 
Work Conditions—knowledge of work schedules, income level of 
jobs, physical conditions of jobs, and job locations. (4) Edu- 
cation Required—knowledge of the amount of education generally . 
Tequired for a wide range of occupations. (5) Attributes Required 
—knowledge of the abilities, interests, and values generally required 


ll 
1 Рог further details, the reader is referred to: Westbrook, Bert W. and 
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State University at Raleigh, in press. 
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for various occupations. (6) Duties—knowledge of the principal 
duties performed in a wide range of occupations. 

Specifications for each of the six areas were designed to insure 
a representative coverage of occupations in eight interest fields: 
service, buisness contact, organization, technology, outdoor, science, 
general culture, and arts and entertainment. Forty-eight occupa- 
tions were selected for each of the first five areas by sampling 
six occupations from each of the eight interest fields. The sixth 
area, Duties, contained a total of 72 occupations, nine from 
each of the eight interest fields. Multiple-choice items were con- 
structed for each of the 312 selected occupations. 

The 312 items which comprised the item pool were reviewed 
from three points of view: technical, subject matter, and editorial. 
The 288 items which survived the review process constituted 
the item-analysis research form which was administered to pupils 
in grades six (А = 991), seven (М = 2124), and eight (N = 
2044). Items with discrimination indices (item-subtest correla- 
tions) below .30 were eliminated, leaving a total of 120 items 
for the final form of the CVMT. The items are distributed as 
follows: Fields of Work, 20 items; Job Selection, 15 items; Work 
Conditions, 20 items; Education Required, 20 items; Attributes 
Required, 20 items; and Duties, 25 items. 

Reliability and validity data. In May, 1970, the final form of 
the CVMT was administered to a total standardization sample 
of 7,367 North Carolina publie school pupils who were en- 
rolled in a statewide career exploration program for grades six 
(N = 1398), seven (№ = 2384), eight (У = 2659), and nine 
(N = 926). 

Kuder-Richardson reliability coefficients were determined for 
each grade separately on each of the six area subtests. They 
ranged from a low of .67 for Job Selection in grades six and 
nine to a high of .91 for Duties in grade 8. Only six of the 24 
coefficients are below .80. 

Criterion-related validity data, based upon a sample of 249 
ninth-grade students, revealed that pupils whose vocational choice 
was in their field of interest and at the appropriate aptitude level 
attained significantly higher mean scores on all CVMT sub- 
tests than pupils whose vocational choice was neither in their 
field of interest nor at the appropriate aptitude level. 
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Correlations between the CVMT area subtests and the Cali- 
fornia Test of Mental Maturity for the sample of 249 ninth-grade 
students ranged from .53 for Work Conditions to .69 for Duties. 

Conclusion. The CVMT appears to be sufficiently reliable and 
valid for use in evaluating career education programs having 
objectives which match the six areas included in the СУМТ. 
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N. І. Gage. Teacher Effectiveness and Teacher Education: The 
Search for a Scientific Basis. Palo Alto, Calif.: Pacific Books, 
1971. Pp. 226. $7.95. 


N. L. Gage has spent a good portion of his professional life in- 
vestigating and theorizing about teaching, having weathered well 
the storm’s pessimism surrounding this area of research. Gage’s 
belief that science can contribute to the art of teaching is the raison 
d'étre of this volume and is reflected throughout its pages. When- 
ever the case for a scientific basis for teaching cannot be made 
purely in terms of empirical results, Gage restates his credo and 
proceeds to demonstrate how things might be if other models, other 
approaches, and more investigative energies were applied to the 
task. Sometimes Gage appears to be operating on pure, “never say 
die” faith, but in certain areas the research findings have been sub- 
stantial enough to contribute to teaching. 

The book is divided into three sections, consisting of an intro- 
ductory chapter and two larger parts—‘“Research on Teacher 
Effectiveness” and “Research on Teacher Education"—composed 
of 7 and 5 chapters, respectively. The individual chapters are 
redrafts of papers which Gage has written over the past decade 
or so, all but one of which were published originally before 1968, 
Consequently, the content is somewhat dated for such а rapidly 
changing area of human enterprise. 

The first chapter, an overview of the book as a whole summar- 
izes the areas of research on teacher education and teacher effects. 
Central to both of these research areas are studies of teacher be- 
haviors and characteristics, which have implications for teacher 
education procedures and student learning. Alexander Pope, like 
many famous writers, derived his own prescriptive maxims for 
teaching, two of which were “Things unknown should be treated 
as things forgotten,” and “Speak with seeming diffidence though 
sure.” In contrast to this kind of experientially-derived wisdom, 
Gage maintains in Chapter 2 that more reliable principles of teach- 
ing can be gleaned from scientific investigation. Not бшу does he 
question the pessimism of Brim, Stephens, Coleman” and other 


i i ü ortunity (Cole- 
1 According to the report on Equality of Educational Opp: ( 

man et al, 1966), when the social backgrounds and attitudes of pia cad 
statistically controlled the proportion of variance in scholastic aem 
accounted for by differences among school characteristics is rather sm 
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contemporary writers concerning scientific contributions to teach- 
ing, but he also demonstrates by “sifting” the literature that, for 
example, several teacher characteristics (warmth, indirectness, cog- 
nitive organization, enthusiasm) are consistently related to stu- 
dent learning. 

Obviously, teachers influence students, but what kinds of influ- 
ences are these and how can they be controlled? Three types of 
behavioral influences—conditioning, modeling, and cognitive 
restructuring—are discussed in Chapter 3. Admittedly, current ex- 
planations of how these influences operate are inadequate. There- 
fore, Gage calls for theories of teaching that will replace the con- 
ceptions described in this chapter. In Chapter 4, the differences 
between theories of teaching and theories of learning, with the 
former possibly subordinate to the latter, are discussed. Various 
analyses of the teaching process—according to types of teacher 
activities, types of educational objectives, and as components cor- 
responding to those of the learning process or types derived from 
families of learning theory—are considered. Applications of these 
analyses to teaching activities such as explaining, mental hygiene, 
and demonstration are also presented. But Gage’s statement that 
the multiple conceptions now required to explain learning and 
teaching may ultimately be reducible to conditioning principles 
and teaching may ultimately be reducible to conditioning princi- 
ples sounds wistfully like unreconstructed Hullianism. 

Chapters 5 and 6 deal with substantive, methodological, and 
logistical paradigms of teaching. Gage’s tenet that theories and 
paradigms are needed to make research more orderly and system- 
atic is revealed especially in these two chapters. The alternatives 
offered to the time-honored “criteria of effectiveness” paradigm are 

teaching process paradigms” (e.g, teaching as information proc- 
essing, teaching as interaction) and “machine paradigms.” Un- 
fortunately, these alternative paradigms are merely described, 
with no attention being given to the implications or possible out- 
comes of using them. Also, the teaching-process and machine para- 
digms must surely be evaluated in some way, but here again cri- 
teria of effectiveness are involved, 

The last two papers in Part I concentrate on microteaching and 
programmed instruction. Reminiscent of post-Hullian theorists 
with their miniature models of learning, Gage believes in analyz- 
ing teaching into basic components rather than making the mistake 
of the earlier educators who attempted to tackle the process as & 
tive to that accounted for by differences in family я puis come 
clusion, however, has been беты ау 2h кыраты — 
statistical grounds. Furthermore, Gage concurs with Bowles, Levin. Mood, 
and others that the differential effects of home, school, and peers on student 
achievement cannot be adequately evaluated by current statistical methods. 


| 
| 
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whole. In fact, since there is so much specificity and situationalism 
in effective teaching, it is recommended that we stop trying to 
train students to emulate “master teachers,” but rather set up in- 
structional goals that are attainable by well-trained teachers in 
concert with programmed instructional devices. Gage makes a 
strong case for programmed instruction throughout the book, cit- 
ing as its advantages the handling of complex tasks involved in 
realizing cognitive objectives and the individualization of instruc- 
tion. Teachers who make optimum use of programming can then 
concentrate more on the socioemotional needs of students. The vir- 
tues of programmed instruction, as extolled by Gage in Chapter 8, 
include several characteristics that have not stood up well under 
research scrutiny. And, as Gage himself would admit, programmed 
instruction is not equally effective with all learners. In fact, teach- 
ing machines and computer-assisted instruction may pose so many 
logistical and other problems as to make these techniques remain 
adjuncts rather than substitutes for human teaching. 

Part II of the volume reviews some of the more specific findings 
and presents numerous suggestions for further research on teacher 
education. As with any collection of papers, there is a lack of con- 
tinuity between adjoining chapters in many instances, and quite a 
bit of “jumping” from one topic to another both within and ђе- 
tween chapters. Part II can be read independently from Part I, 
and will prove especially useful to graduate students in education 
who are searching for a dissertation topic. Unfortunately, Part IT, 
and the book as a whole, is not really a comprehensive tour de 
force containing a master design for research on teaching. Insufü- 
cient attention is given to the interaction of teaching method with 
age, sex, and other individual difference variables. For example, 
are different personality characteristics and instructional skills re- 
quired of secondary teachers than of primary teachers? Also, Gage 
gives short shrift to the development and application of what is al- 
ready known—either from research or “the conventional wisdom’ 
—about good teaching. I suspect that someone could argue rather 
convincingly that we already know quite a bit about how children 
should be taught; the rub comes in cultivating the proper attitudes 
and teaching skills in prospective techers of average ability who 
somehow manage to ignore or forget a large portion of what they 
are exposed to in teacher training institutions. 

To summarize briefly the remainder of the book, Chapter 9 be- 
gins with a review of the studies of Conant and Flexner and pro- 
ceeds to debunk the notion that educational research findings are 
merely obvious common sense. Gage then considers the status and 
prospects of research in teacher education under the categories of 
substantive results, methodology, and logistics, and whether the 
information and techniques falling in these three categories have 


1140 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


already been applied, are waiting to be applied, or still need to be 
researched. Chapter 10 is an interesting synopsis of the contribu- 
tions that educational psychology can make to teacher education, 
with a focus on teaching the culturally disadvantaged and in the 
great cities. Chapter 11 deals with various approaches to evalu- 
ating teachers for administrative purposes (by means of student 
ratings, student achievement, and observations) and for self- 
improvement. Although the original paper from which this chapter 
was taken was written over a decade ago—before the current ac- 
countability flareup in education, the rise of teacher evaluation as 
a criterion in research is referred to. In Chapter 12 Gage concludes 
from an analysis of various experimental results that feedback 
of student ratings of teachers to the latter improves teacher behav- 
ior. Finally, Chapter 13 deals with the changing roles of the 
teacher, particularly in regard to programmed instruction, and 
other tools of the trade. The last section of this chapter summa- 
rizes Gage's position on teacher training—away from predictions 
of overall effectiveness to improvement of specific skills through a 
careful analysis of specific teacher behaviors and instructional acts 
and their resultant specific effects on students. To Gage, a very 
important part of the new look in teaching involves packaged pro- 
grams and other devices that can be used consistently by differ- 
ent people in different situations. 

3 Clearly, neither N. L. Gage nor anyone else can foresee pre- 
eisely what teaching will be like in the future. Certainly, the newer 
models and paradigms that Gage describes should be explored, 
but the explorer would do well not to expect revolutionary conse- 
quences. Microanalysis of the teaching process may lead to im- 
provements in specific skills, but, as with factor-analyzed tests 
compared to general intelligence tests, may not lead to more ef- 
fective realization of a composite outcome condition, Also, teach- 
ing machines and programmed instruction, which some would say 
have crested already, are useful in specific situations. But the lo- 
gistical problems апа potential sociopolitieal effects of more ma- 
chine instruction should be viewed with caution. Finally, during a 
time when people are reexamining the goals of publie education, 
ee whether a child’s education should be left entirely up to 

ly, we should not be content to concentrate exclusively on 
methods and techniques of attaining traditional cognitive, affec- 
oe or psychomotor objectives. Ends as well as means are impor- 
ant. 
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John P. Hubbard (with a chapter by Charles F. Schumacher). 
Measuring Medical Education, the Tests and Test Procedures 
of the National Board of Medical Examiners. Philadelphia: 
Lea and Febiger, 1971. Pp. xii + 180. $8.50. 


The charge to the National Board of Medical Examiners is to 
produce a reliable and valid examination procedure for use in li- 
censing physicians. The Board has accepted the challenging prob- 
lem and has made a strong effort to overcome many testing evils. 
They have changed from an unreliable essay format to one of 
multiple-choice for Parts I and II of the examination. Part I 
covers the six basic sciences such as anatomy, physiology, etc. 
Part II includes six clinical sciences such as surgery, pediatrics, 
etc. Part III has been changed from unreliable oral examinations 
of clinical patient problems to multiple-choice and patient man- 
agement simulations. Part III is intended as a measure of clinical 
competence. 

For the reader who does not wish to deal with the technicalities 
of test analysis, Hubbard’s book is an informational account of 
construction of multiple-choice examinations. Part III, which deals 
with the assessment of clinical competence, is especially helpful. 
A lack of control of examiners and patients caused discontinuation 
of oral patient management examinations. A three year study of 
independent evaluation of single candidates by two examiners 
yielded a correlation coefficient of .25 for a total of 10,000 exam- 
inations. To establish “what to test,” 3300 critical incidences were 
Collected, analyzed, and categorized. 

As a тен RS different kinds of testing Were adopted. The 
first is a paper and pencil simulation of patient management 
problems. In this type of testing, a simulation of а visit with а pa- 
tient is presented. As more information is required, the examinee 
Chooses from a list of options; these may include relevant and ir- 
relevant information about a physical examination, diagnostic 


tests of laboratory tests. When a student believes he has enough in- 


f i iagnostic option followed by a treatment 
an nea 5 В step only the information 


Option from comprehensive lists. At eac 
or result of a selected option is revealed. | } | 

In essence, the student is scored on the basis of his whee in 
Tight decisions as compared to the decisions of a group of experts. 
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Initially he is given a “handicap score.” For every error made, a 
point is subtracted; for every right decision, a point is added. The 
reliability of the programmed patient management problem is gen-. 
erally .80 to .85. There are two categories of failures on patient; 
management problems. They are the “shotgunners” and the “timid 
souls”; those who overdo and underdo the information gathering 
and decision making. The way reliability of this unusual test was 
established was not explained. It seems the reliability issue ought 
to be further clarified. 

The other section of Part III consists of multiple-choice items | 
based on interpreting pictures, tables, charts, X-rays, etc. The 
intercorrelations of Parts II and III range from .34 to .48. This is 
desirable indicating the two parts are measuring different abili- 
ties. 

For a reader wishing to know how the results of examinations ^ 
are reported to medical schools, Chapter 7 is quite helpful. Essen- ' 
tially, the results are given on each part, with the performance of 
each of its subsections reported in terms of Average Standard 
Score and the Standard Score Deviation. The report is tabulated 
by school for curriculum guidance purposes. 

The National Board is to be commended for the job done to 
date; however due to technical errors, their charge is far from being 
finished. Apparently the Board has produced a content valid ex- 
amination and one which has high scoring reliability. However, it ^ 
has not produced a test which maximizes validity for selection 
purposes. The purpose of selection examinations (as are the Na- 
tional Board Examinations) is to provide information for rejec- 
tion or acceptance of candidates. However the National Board 
Examinations do not appear to be constructed according to selec- 
tion testing criteria. 

_ Richardson (1936) has delineated appropriate criteria for selec- 
tion tests: “It is definitely established by these experiments that 
tests of different difficulty will predict to a two-categoried criter- | 
ion with different degrees of effectiveness. If it is desired to sepa- 
rate off а minor proportion (of individuals, my comment) from 
the lower end of the distribution of criterion scores, then an easy 
Ex much greater validity than have more difficult tests (р. 

The National Board has used the Kuder Richardson Formula 
20 (KR 20) for establishing a reliability estimate. This is а pro- ^ 
cedural error. The КВ, 20 is maximized when there is a normal 
distribution of true scores, but this is inconsistent with maximiz- 
ing the validity of a selection test. Cronbach and Warrington 
(1952), building on Richardson’s work state: “When а multiple- 
choice test is intended to reject the poorest F per cent of the men 
tested, items should on the average be located at or above the « 
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n for men whose true ability is at the Fth percentile (p. 
di. 

А normal distribution of true scores cannot be established on 
this basis. Since KR 20 is maximized for a normal distribution and 
not for other distributions, other methods for establishing reliabil- 
ity should be sought. Perhaps the reliability coefficient suggested 
by Livingston (1971) would be more appropriate. Other approaches 
might include use of the multiple discriminant function or а mul- 
tiple-R for dichotomous data. 

As in other test, construction endeavors, the National Board has 
problems establishing validity. The ultimate criterion for valida- 
tion purposes is the performance of practicing physicians. Since 
these data are presently unattainable, three intermediate criteria 
are used. One is content validity, which is established by the proc- 
ess used for selecting items for examinations. Another is medical 
school grades. A third is seeing how students with differing degrees 
of educational background perform; better educated students 
ought to achieve higher scores. They do. 

Another difficult problem for the National Board of Examiners 
lies in establishing cutoff scores. As Cronbach (1970) points out, 
“Setting a cutting score requires a value judgement (p. 423).” The 
Board has established cutoff scores for Parts I and II, and pre- 
sumably for Part III, although this was not made clear. It is ap- 
parent that cutoff scores are set on each of the parts, but not on 
the component sections that make up the part. The decision to 
establish cutoff scores at that level is based more on an intuitive 
than scientific basis. 

The Board feels that it is reasonable to allow students to com- 
pensate for weakness in one section by doing better in another. A 
justification from page 63 is: “While the MPL (minimum pass 
level) is emotionally appealing, one must question whether exam- 
iners can make such predictions with a high degree of reliability 
and validity, and whether the resulting standards can have the 
stability required for an instrument that is to be used for the pur- 
pose of certification.” 

A counter argument could be advanced that students should 
meet a MPL on each section if a specific discipline represented by 
а section contributes unique and necessary information to a phy- 
sician’s competency. Therefore, the reasoning used in m 
cutoff scores for the parts ought to be applied to establishing си 
off scores for the sections. How the reliability of the parts cutoff 
Score is any better than the reliability of the section cutoff scores 
is not explained. It is clear that а KR 20 approach is not appro- 

riate in either case. Ue tide 
The validity, (questioned in the above quote) since it is content 
based, would not necessarily be affected by the use of a MPL but, 
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as suggested by Richardson, by the difficulty level chosen. It 
seems the cutoff score ought to be based on the examiners’ judge- 
ments of competency and not on how high or low the failure rate 
should be. Incompetents should be failed; competents should be 
passed. The method used still avoids the issue. After all, the per- 
formance of examinees scoring one standard error of the measure- 
ment above or below the cutoff score is still in doubt. Individuals 
in this area of doubt should be tested further. 

As stated before, the National Board is to be commended for the 
job it has done to date. It is hoped that the other health professions 
view their certification programs as diligently as the National 
Board of Medical Examiners have. 
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Edwin I. Megargee. The California Psychological Inventory Hand- 
и a Francisco, Calif.: Jossey-Bass, 1972. Pp. xxvi + 298. 


Although the California Psychologi 
i gical Inventory (CPI) rep- 
resents perhaps the most Important advance in personality mea- 
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tailed information concerning the construction, refinement, and 
validation of each seale, along with reliability estimates, evalu- 

{\ative summaries of the validational research, and brief descriptions 
of some new CPI scales. Counselors and clinicians will probably 
be most interested, however, in the third section wherein Megargee 
discusses test interpretation. After summarizing and integrating 
the results of 20 factor and cluster analyses of the CPI, he presents 
adjectival correlates of each scale derived from peer ratings, and 
concludes with a chapter on the interpretation of individual pro- 
files. Finally, section four of the book contains a review of the 
CPI as it has been used in various research settings, emphasizing, 
in particular, interesting scale interactions and solutions to various 
multivariate prediction problems. 

Several features of this handbook are noteworthy. First, it is not 
an “in-house” publication of the Institute of Personality Assess- 
ment and Research at Berkeley where the CPI was developed; it 
is definitely Megargee’s book, a fact which is seen, for example, in 
the way he consistently distinguishes between his and Gough’s 
attitudes toward the test. Second, the book is clearly written and 
easy to read. The clarity is perhaps most obvious in the discussion 
of such potentially obscure psychometric issues as strategies for 
item selection and criticisms of step-wise regression techniques. 
These and several other technical topics are handled with admir- 
able lucidity and dispatch. | 

Another valuable feature of the book is that № contains a good 
deal of important information that is literally unavailable else- 
where. Most people who know Gough well feel that his best ideas 
appear primarily in letters and private conversations. As Megargee | 
notes, “Gough is опе of the few test authors who has formally 
articulated his values and his philosophy of testing; unfortunately, 
the most cogent expression of his principles are contained in un- 
published papers and personal correspondence” (p. 10). Megargee 
presents these principles in detail and, as a result, demystifies such 
otherwise puzzling features of the CPI as, for example, the arrange- 
ment of the scales on the profile sheet and the ‘value-loaded 
names of many of the scales. М \ 

А fourth commendable feature of the book is its superb review 
of the CPI literature. A wealth of data is summarized, integrated, 
and evaluated in a judicious and balanced fashion. A oe 
perusal, for example, uncovers such nuggets as the highly ви 
ized, self-critical man is most likely to have the highest cholestero 
level (p. 242), and men who undergo vasectomies have high scores 
for Socialization (p. 243). 

Still another attractive 
individual profile interpretations. 
determining profile validity, Meg 


feature is the detailed presentation of 
Beginning with instructions for 
argee describes scale interpre- 
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tation and pattern analysis, illustrating his comments with actual 
case histories and profiles. He ends the discussion with a proper 
caution against over-interpretation and an acknowledgment of the 
contribution of purely actuarial approaches to test interpretation. 
Once again, this section of the book should be quite useful to prac- 
titioners in a variety of settings (e.g., counseling centers, proba- 
tion departments, and guidance clinics). 

The reviewer was particularly impressed with the manner in 
which Megargee discusses the “response-set problem.” Although 
he presents correlations between CPI scales and measures of social 
desirability, he recognizes (a) that the CPI Good Impression scale 
was one of the earliest measures of social desirability to appear; 
and (b) that such variance is more often valid than artifactual. 
In addition, Megargee treats the topic of acquiescence response 
set with the benign neglect that it appears to deserve. 

From a history of science perspective, perhaps the most im- 
portant contribution of this book is to dispell some of the per- 
sistent myths that have haunted the CPI since its inception. For 
example, because the usual goal of personality inventories is trait 
specification (with the result that correlations with non-test cri- 
teria are considered “peripheral”), the CPI has been round: crit- 
icized for factorial impurity and heterogeneity of its seal. As 
Megargee points out, however, the CPI should be evaluated in 
terms of how well it achieves the goals that were originally set for 
it, rather than in terms of a reviewer's aesthetic predilections. The 
purpose of the CPI is to predict what an individual will do in a 
specified context and/or to forecast how he will be described by 
those who know him well. Consequently, scale homogeneity and 
factorial independence are relevant evaluative criteria for the 
CPI only if they can be shown to improve its predictive utility. 
Megargee also puts to rest the criticism that the CPI lacks any 


theoretical underpinnings. On this point he all k 
for himself: PUA E 


Because the instrument is intended for the diagnosis and com- 
prehension of interpersonal behavior, the concepts selected are 
those that occur in everyday social living, and, in fact, arise 
from social interaction. Most simply, such variables may be 
described as ‘folk concepts’—aspects and attributes of inter- 
personal behavior that are to be found in all cultures and в0- 
cities, and that possess a direct and integral relationship to all 
forms of social interaction (р. 12). 


A third persistent myth is that Gough doesn’t understand the fac- 
tor structure of his inventory, a fact reflected by the manner in 
which the scales are grouped-on the CPI profile sheet, As Megargee 
" observes, however, Gough arranged his scales to facilitate clinical 


if 
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interpretations of profiles rather than to reproduce psychometric 
factors or clusters. 

There are some points in the book that I would argue with, only 
one of which need be mentioned; i.e., Megargee asserts without 
documentation that “In recent years, research has been performed 
using Negro, Mexican-American, and American Indian subjects. 
The results of these studies are disquieting because they show that 
lower class minority-group members often obtain lower scores on 
most CPI scales” (p. 249). The reviewer’s own data on this subject 
do not support Megargee’s statement. 

In summary, this is a scholarly, well-written, and exceedingly 
helpful introduction to the CPI, a valuable and authoritative ref- 
erence source for both practitioners and students, and an important 
contribution to the literature of personality assessment. It is also 
the most definitive study of the CPI available and as such should 
become required reading for students in a variety of applied areas 
of psychology. 

ROBERT HOGAN 
The Johns Hopkins University 


Stanley A. Mulaik. The Foundations of Factor Analysis, New 
York: McGraw-Hill, 1972. Pp. xvi + 453. $14.95. 


This monograph joins a small handful of books on technical 
issues in modern linear factor analysis. Mulaik has carefully 
dredged the factor analytic literature of the 1960’, as well as 
that of certain allied fields, to produce a sound and scholarly 
book which should serve his purposes well. As its title suggests, 
the aim of The Foundations of Factor Analysis “is to provide the 
reader with the mathematical rationale for factor-analytic pro- 
cedures” (p. xiii). While it is claimed that the book was de- 
signed as a text for (graduate) students of the behavioral or 
social sciences” (p. xiii), much of the book will be out-of-bounds for 
many students who lack а prerequisite working knowledge of 
vectors, matrices and differential calculus. 

Chapters 1-4 were contrived to prepare the reader for the re- 
mainder of the book. The history portion is brief but pointed; 
the vector and matrix algebra is also brief and there are no 
exercises; the section on calculus, too, serves primarily to orient 


is particular notational system; an over- 
бе гован Ба find maxima and minima. 


arse, but sound, treatment 


Image analytic ideas are also outlined. 
In Chapter 5 are found the basic equations of component and 
factor analysis, showing how vector-variables and total variance 


are decomposed in these two systems. Chapter 6 details various. 
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numerical methods for factoring square, symmetric matrices; in- 
cluded are Cholesky, centroid, and a variety of principal-axis 
methods. Givens-Householder methods for tri-diagonalization 
solutions for principal axes are briefly described, as are some 
modifications by Francis, Ortega, Kaiser and Wilkinson. Mulaik 
provides enough background in modern numerical analysis to 
drive home the point that round-off errors and computer effi- 
ciency must be considered in work-a-day applications of matrix 
decomposition theorems. 

Chapter 7 is pivotal for it is here that Mulaik undertakes a 
thorough mathematical examination of modern methods for com- 
mon factor analysis, The author begins in a neo-Thurstonian tradi- 
tion, basing his developments on population correlation matrices; 
much of Guttman’s classic work on communality bounds is 
presented; the chapter ends with a long section on fitting the 
common factor model with emphasis on least-squares and maxi- 
mum likelihood methods as they had been developed up to 1969. 
A close reading will uncover problems which may confuse: for 
instance, terms such as “positive definite" and “semidefinite” 
(p. 188) as well as symbols for a matrix trace (p. 151) are first 
used here with no prior definition, Moreover, it is not always 
clear as to whether population or sample matrices are intended 
(pp. 156-59). That both the Fletcher-Powell and the Newton- 
Raphson numerical procedures are described may be seen as а 
bonus in comparing this book with its competitors. 

Chapter 8 is at least as useful as its predecessor. Fortunately, 
Mulaik has seen fit to make clear distinctions and to elaborate 
relationships among weighted and unweighted component analysis, 
Harris’ brand of image analysis, image factor analysis as well as 
canonical and alpha factor analysis, Careful reading of this chapter 
should repay the effort, for the factoring specialist as well as the 
applied researcher. When opinions about applications are given 
they are typically reasonable, Nevertheless, some persons may 
argue that it is an unsound “rule of thumb” to retain “only enough 
principal components to account for, say, 95% of the total vari- 
ance” (р. 176) (our italics) for a principal-axis factoring of R. 

; Chapters 9-11 deal with factor transformations. Simple-struc- 
ure is discussed, á la "T hurstone, Graphical methods for generating 
oblique solutions are first outlined, then elaborated in a step- 


limi б ар equamax, parsimax, oblimax, ob- 
imin, biquartimin, maxplane, ete, Fortunately, the most pro- 


for oblique solutions are also 
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presented. While the methods themselves may give the appear- 
ance of witchcraft, Mulaik’s writing is nicely balanced, exhibiting 
substantial homework. 

Generating solutions to fit preconceived structural hypotheses 
is examined first in Chapter 12, Procrustean Transformations. 
In Chapter 15 (which may have been misplaced), the second 
edition on Confirmatory Factor Analysis is found. Taken together, 
these chapters constitute most unusual as well as most needed 
innovations vis-a-vis the contents of other extant books on factor 
analysis. Whether or not the specific procedures of these chapters 
become commonplace, the future of factor analytic applications is 
undoubtedly going to be forcefully influenced by the availability 
of these kinds of methods. Again, careful reading is likely to pay 
off, perhaps especially for Jéreskog’s work on confirmatory meth- 
odology. 

The subject of Chapter 13 is Factor Scores, including a much- 
needed study of the problem of factorial indeterminacy. Although 
Psychometrika continues to be laced with papers on this topic, 
Mulaik’s development is illuminating. 

Chapter 14 on Factorial Invariance is good to have included, 
but the issues here are, to us, among the most unsettling of all 
those in the book, At this stage in the history of factor analysis, 
where nearly all solutions are best regarded as largely arbitrary, 
we need much more than aseptic equations for comparing factors 
when considering “invariance.” One useful approach to this general 
issue which was not yet available when this book was written 
is that of Karl Jéreskog on “Simultaneous factor analysis in several 
populations” (Psychometrika, 86, 409-426). | р 

The final chapter (15) is entitled “Factor Analysis and Multi- 
variate Analysis.” This is a useful chapter since investigators often 
need to consider factoring methods in relation to other kinds. of 
multivariate procedures. Regression systems for studying criterion 
sets in relation to predictor sets are contrasted with factor analysis; 
separate transformations for predictors and criteria, stepwise те- 
gression, canonical correlation and discriminant function analysis 
are included. Together with, say, the final chapters of W. W. Roze- 
boom's (1966) Foundations of the Theory of Prediction, this chapter 
ought to be required reading for behavioral science methodologists. 

Tt should be clear that we regard Mulaik’s work highly. esas 
matically sophisticated students will be especially edified. This 
book is as apt to facilitate reading of much of the current issues 
of Psychometrika as any other text we know. " Pn 

We would hope to see several modifications in a second е‹ Hs (а 
exercises and more numerical examples ought to be d ; the 
it is, large portions of this text will seem austere un "y di p 
reader is quite familiar with empirical factor analytic studies. 
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More attention might also be given to the role of the computer 
in general multivariate research. The immense power of computers 
now permits interactive analyses, graphical and tabular displays, 
as well as detailed examination of “lack of fit” information. 
Mulaik has provided groundwork for these kinds of data analytic 
questions, but as it now stands, the applied researcher has too 
little guidance for these most serious matters. 


Ковевт M. PRUZEK 
State University of New York 
at Albany 
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Bruce W. Tuckman, Conducting Educational Research. New York: 
Harcourt Brace Jovanovich, 1972, pp. xiii + 402. $8.95. 


This new addition to the' pool of textbooks for educational 
research design courses reflects the sensitivity to experimental 
validity that was generated by Campbell and Stanley (1963). 
Tuckman devotes almost the entire book to the topics which 
are crucial to the formulation of an accurate and complete con- 
cept of validity. The organization of the content also enhances 
the emphasis of the book. 

Tuckman begins by briefly discussing the role of research and 
the selection of a problem. An indication of the importance he 
gives to validity in research is that internal and external validity 
are first mentioned on the third page of the text. These concepts 
serve to unify much of the remainder of the book. Tuckman is 
careful to make clear the distinction between results which are 
valid in a local situation and results which may be valid in more 
global settings. 


He then begins consideration of variables, with two chapters 


À а Л second, constructing 
operational definitions of variables. Tuckman explicitly distin- 
guishes among five kinds of variables: independent, dependent, 
е specification of each 


merous carefully chosen 
examples taken from the research literature and from student 


reports. The bg i each class is clear, and the com- 
parisons among the variables is quite helpful i i 
adequate understanding of each, E E e 
The next two chapters are devoted to the manipulation and 
control of variables and to research design. Threats to internal 
and external validity are diseussed, with Special attention given 
to experimental procedures which сап be used to control for 


шша лш у ней‏ ق س 
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selection, history, and instrumentation. Research designs are 
grouped according to the following categories: pre-experimental, 
true experimental, factorial, quasi-experimental, and ex post facto. 
A final section covers three special kinds of designs. Again, 
effective use is made of selected examples. 

Three complete research reports are included as appendices 
to the book. These reports are first discussed in the chapter on 
definition of variables, but more extensive discussion is made in the 
context of experimental control. Further references are made to 
these studies throughout the book, thus lending additional continuity 
to the exposition. Tuckman is not bound by these three studies, 
however. Many more examples are abstracted for their particular 
usefulness in illustrating specific points. 

The next two chapters deal with observation and measurement 
and with questionnaires and interview schedules. Construction 
of instruments is emphasized, with partial and complete examples 
of measurement scales and questionnaires given as illustrations. 
Tuckman’s discussion is straightforward, with attention given to 
clarifying the examples rather than merely describing them. 

The concluding four chapters, each relatively brief, are devoted 
to statistical analysis, data processing, research reporting, and 
conducting evaluation studies. The statistical analysis chapter is 
terse and would probably need to be supplemented by other 
references. The research reporting chapter is geared to journal 
writing with the emphasis placed on clarity and unambiguous- 
ness of writing style. Again, many well-chosen examples are in- 
cluded in the text. 

Tuckman has written this text to reflect three premises: (1) 
Research is a useful tool essential for uncovering causal relation- 
ships. (2) Research, much of it done in the field, must identify 
relationships between variables. (3) Although field research can- 
not be perfectly controlled, much of the “noise” can be eliminated 
through existing techniques. Each topic discussed in the book speaks 
to at least one aspect of these premises. 

Tuckman does use a compact writing style; he. does not over- 
burden the inherent simplicity of the concepts with unnecessary 
verbiage. He relies heavily on examples аз explanations of the 
concepts. This approach serves two purposes. First, the examples 
provide numerous opportunities for class discussions; students 
could be challenged to go beyond the direct explanations Tuckman 
gives. Second, the orientation toward examples provides an in- 
ductive approach to learning the fundamentals of research Aii 
sign. Hence, it is possible that students interpretations of dn 
concepts would be somewhat varied, and discussion might be 
needed to clarify these interpretations. Too, the inductive ap- 


proach allows may possible uses of the text in a research design 
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course. Effective use of the text might depend in part on the 
particular instructor. 

Tuckman has in the opinion of the reviewer emphasized the 
most important aspect of research: control. He has avoided 
almost entirely the discussion of mechanical details that tend to 
divert the attentions of the novice researcher from the primary 
goal of obtaining reliable information. The emphasis throughout 
is on conceptualization. This theoretical approach ought to pro- 
vide a very good base both for those who must perform and for 
those who must interpret educational research. 

Tuckman reveals a substantial bias toward experimental re- 
search. For those people whose primary interest is in historical 
or survey research, this text will probably not be as useful as 
other available books. In the opinion of the reviewer, however, 
Conducting Educational Research is among the best of the 
available books for an introductory course in research design, 
provided that the students have some familiarity with basic 
statistical procedures. It is a refreshing departure from the typical 
book which gives only passing attention to the problems of re- 
search design. The author is a very effective spokesman for the 
experimenta] emphasis of current educational research. 
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John P. Van de Geer. Introduction to Multivariate Analysi: 
in de Geer, ysis for 
the Social Sciences. San Francisco, Calif.: W. Н. Freeman, 
1971. Pp. xi + 293. $9.50. 


The book by Van de Geer is one of several that have come 
out recently, designed to provide graduate students in the be- 
havioral and social Sciences with an introduction to multivariate 
analysis. Prerequisite skills to handle the book are not spelled 
out, but they evidently inelude some training in statistics. The 
necessary matrix algebra and calculus are developed in the first 
part of the book. The second part includes such topics as re- 
gression and path analysis where there is only one dependent 
variable, multiple and partial correlation, factor analysis, canon- 
bug correlation, linear structural equations and discriminant 

alysis. 


Though matrix algebra and calculus take up seven chapters ог 


М 


Он 
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a third of the book, some of the more important topics in these 
areas receive little or no attention. For instance, there is no 
discussion of definite matrices, nor any discussion of either the 
general eigenequation |A — АВ| = 0, or extrema of the ratio of 
quadratic forms. Instead the author repeatedly derives from first 
principles the eigenequation associated with the ratio of quadratic 
forms in later chapters. Furthermore several important theorems 
are stated imprecisely and the proofs are misleading if not erroneous. 

Chapter 8 deals with the multivariate normal distribution. 
Straight-forward as it is, the author’s notation differs from the 
conventional, and at times is inconsistent. For example, the author 
does not attempt to distinguish between parameters and their 
sample estimates notationally, and this leads to conceptual dif- 
ficulties in later chapters. 

An overview of multivariate analysis is attempted in chapter 
9. In the introduction, the author tries to explain some basic 
statistical concepts. The notion of observable vs. unobservable 
variables is introduced but at times it is not clear what is meant. 
It is obvious that the author’s “unobserved variable” corresponds 
to “latent variable” but it is defined as “what we mean sure”! 
No attempt is made to define or explain the concept of a random 
variable and often it is not clear whether the author is referring 
to mathematical or random variables. For instance, it is not 
clear whether the systematic component frequently alluded to is 
a random variable since the author explains the error term as 
“. . . e; could be a random error variable, or a specific systematic 
component . . . ." The reviewers feel that understanding the 
underlying statistical model is crucial in multivariate analysis, 
and the notion of random variable is indispensable to this under- 
standing. The author proceeds to present overviews of multivar- 
iate analysis pictorially and in terms of “matrix operations,” both 
of which appear highly artificial and vacuous. у 

Chapters 10 and 11 deal with regression and path analysis, 
and here again, the language is bound to cause confusion. As 
an example, consider the statement “. . . we might draw а line 
that runs, roughly, through the averages of Ta for given observed 
values of z,. If this line is a straight line, it is also called a 
regression line.” Another instance is, “Regression theory can be 
further generalized to include problems where one variable de- 
pends on several fixed variables, but this falls outside our scope. 
TThis is unclear since the author does not explain the meaning 
of fixed variables and moreover derives the estimates of the 
parameters for the general linear model at the end of the hapit 
In spite of his stand against introducing distribution theory, S 
without defining the concept of а conditional distribution, | i 
author gives an interpretation of the regression line in terms 0: 
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conditional distributions, and once having done so fails to exploit 
it for the case of several variables. 

Factor analysis is given more attention than the other multi- 
variate procedures. Two chapters deal with the exposition of 
fundamental concepts and some of the multitudinous guises of 
factor analysis. The author’s stand on factor analysis is clear 
from chapter 1 where he defines factor analysis as a procedure for 
finding “а linear compound of the observed scores that has max- 
imum variance.” To make matters worse, the author considers 
factor analysis as a special case of principal components analysis 
in the sense that factor analysis is “principal-components analysis 
applied to standardized variables”! This statement is clearly 
erroneous. The author fails to convey the important distinction 
between the two procedures: that principal-components analysis 
is merely an algebraic transformation rather than the result of 
a fundamental statistical model. Moreover, the factor model is 
written as 


ж = јар *** + fimYm + 
&,t=1,+++,m, 


where 2, is the (n x 1) vector of observations on the ith variable, 
and y; is the (n X 1) vector of factor scores on factor j. Ap- 
parently there are m variables to be “explained” in terms of m 
factors. The model for the correlation structure is derived in 
terms of the sample correlation matrix and not the population 
correlation matrix. However, in the discussion of the maximum 
likelihood estimation procedure the author emphasize this dis- 
tinction and this could result in confusion. Furthermore, important 
topics such as interpretation of vectors, rotation, and es- 
ee of factor scores are glossed over very rapidly and care- 
essly. 

Canonical, alpha-factor analyses, and what the author calls 
canonical discriminant factor analysis are discussed in the chapter 
entitled “Varieties of Factor Analysis.” Canonical discriminant 
factor analysis is merely discriminant analysis applied to several 
groups (Rao, 1952), but as the author does not refer to the work 
of Rao explictly, it may be difficult for the students to identify 
it with varieties of factor analysis discussed in available texts. 
The property of scale invariance is presented as a property of 
the model and not as a consequence of the estimation procedure. 
The author fails to note that in general procedures which reduce 
to finding the extrema of ratios of quadratic forms are scale 
invariant. This is a weakness since it is the express purpose 
of the author to provide a unified treatment of the various pro- 
cedures. No reference is made to the contributions of McKeon 
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or McDonald (1968) in this area. Maximum likelihood es- 
timation procedure is grouped with these varieties of factor 
analysis. Through the maximum likelihood estimation procedure 
and Rao’s formulation of factor analysis lead to the same es- 
timates, the procedures are not comparable and distinctions 
such as these are not made clear. It is unfortunate that the 
recent and important advances in the computational procedures 
for solving the likelihood equations are not acknowledged. In- 
stead, an early and now obsolete procedure for solving these 
equations is outlined. 

The rest of the book deals with path analysis, canonical cor- 
relations, general linear structural equation systems and dis- 
criminant analysis. Most of these topics are treated adequately, 
and the chapter on discriminant analysis is particularly good. The 
book concludes with an appendix on the computational procedure 
for obtaining the eigen values and eigen vectors of a matrix. The 
reason for including this appendix remains a mystery—it does 
not serve any purpose and the steps outlined seem to contain 
errors. 

There are several flaws in the book that limit its usefulness. The 
most obvious is the lack of references for the reader who wants to 
pursue the topics in greater detail. The author attempts to talk 
to readers at different levels simultaneously, mentioning for ex- 
ample conditional distributions and expected values without any 
preparation. A third flaw is the author's use of language: the 
language is confusing, at times misleading, and the author's .ex- 
cessive use of logical connectives where they are inappropriate 
is annoying. The author has not felt it necessary to include 
the general linear (multivariate) hypothesis, nor any tests of 
significance. The reason given for not including MANOVA is 
the existence of excellent texts on the subject, but this is more 
nearly true of factor analysis and most other techniques which 
are covered. There are not very many texts that introduce stu- 
dents to MANOVA and which explain how univariate procedures 
can be extended to the multivariate situation. Moreover most 


researchers in the social sciences do not seem to be familiar with 
d the most serious criticism 18 


а ion. The last ап 
multivariate a dures included are presented 


that nearly all the multivariate procedures ine! ‹ 
simply as editis for data reduction with little or no emphasis 


on their statistical nature, and hence, y is ЕД possible to acquire 
1 ding of the techniques discussed. — ў 

1 "he ees poni to bring together data reduction ее 

from different disciplines and to provide a unified Saas 

these procedures is highly commendable. ЕЙ | р 

book has serious defects it is suggested, in the spirit o р 

Nader, that it be recalled and reissued after some revision. 
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